Proper way to log things when using DDP #6501

jandonov · 2021-03-12T20:52:25Z

jandonov
Mar 12, 2021

Hi, I was wondering what is the proper way of logging metrics when using DDP. I noticed that if I want to print something inside validation_epoch_end it will be printed twice when using 2 GPUs. I was expecting validation_epoch_end to be called only on rank 0 and to receive the outputs from all GPUs, but I am not sure this is correct anymore. Therefore I have several questions:

validation_epoch_end(self, outputs) - When using DDP does every subprocess receive the data processed from the current GPU or data processed from all GPUs, i.e. does the input parameter outputs contains the outputs of the entire validation set, from all GPUs?
If outputs is GPU/process specific what is the proper way to calculate any metric on the entire validation set in validation_epoch_end when using DDP?

I understand that I can solve the printing by checking self.global_rank == 0 and printing/logging only in that case, however I am trying to get a deeper understanding of what I am printing/logging in this case.

Here is a code snippet from my use case. I would like to be able to report f1, precision and recall on the entire validation dataset and I am wondering what is the correct way of doing it when using DDP.

    def _process_epoch_outputs(self,
                               outputs: List[Dict[str, Any]]
                               ) -> Tuple[torch.Tensor, torch.Tensor]:
        """Creates and returns tensors containing all labels and predictions

        Goes over the outputs accumulated from every batch, detaches the
        necessary tensors and stacks them together.

        Args:
            outputs (List[Dict])
        """
        all_labels = []
        all_predictions = []

        for output in outputs:
            for labels in output['labels'].detach():
                all_labels.append(labels)

            for predictions in output['predictions'].detach():
                all_predictions.append(predictions)

        all_labels = torch.stack(all_labels).long().cpu()
        all_predictions = torch.stack(all_predictions).cpu()

        return all_predictions, all_labels

    def validation_epoch_end(self, outputs: List[Dict[str, Any]]) -> None:
        """Logs f1, precision and recall on the validation set."""

        if self.global_rank == 0:
            print(f'Validation Epoch: {self.current_epoch}')

        predictions, labels = self._process_epoch_outputs(outputs)
        for i, name in enumerate(self.label_columns):

            f1, prec, recall, t = metrics.get_f1_prec_recall(predictions[:, i],
                                                             labels[:, i],
                                                             threshold=None)
            self.logger.experiment.add_scalar(f'{name}_f1/Val',
                                              f1,
                                              self.current_epoch)
            self.logger.experiment.add_scalar(f'{name}_Precision/Val',
                                              prec,
                                              self.current_epoch)
            self.logger.experiment.add_scalar(f'{name}_Recall/Val',
                                              recall,
                                              self.current_epoch)

            if self.global_rank == 0:
                print((f'F1: {f1}, Precision: {prec}, '
                       f'Recall: {recall}, Threshold {t}'))

Answered by SkafteNicki

Mar 31, 2021

Hi all,
Sorry we have not got back to you in time, let me try to answer some of your questions:

Is validation_epoch_end only called on rank 0?

No, it is called by all processes

What does the sync_dist flag do:

Here is the essential code:
https://github.com/PyTorchLightning/pytorch-lightning/blob/a72a7992a283f2eb5183d129a8cf6466903f1dc8/pytorch_lightning/core/step_result.py#L108-L115
If sync_dist=True then it will as default call the sync_ddp function which will sum the value across all processes using torch.distributed.all_reduce
https://github.com/PyTorchLightning/pytorch-lightning/blob/a72a7992a283f2eb5183d129a8cf6466903f1dc8/pytorch_lightning/utilities/distributed.py#L120
Use this …

View full answer

flukeskywalker · 2021-03-12T23:09:50Z

flukeskywalker
Mar 12, 2021

I have the same question, and have not been able to get sufficient clarity from the docs about how logging works during distributed training.

I found the suggestion to use the sync_dist flag when logging during validation/testing in distributed training, but it's unclear exactly what this does, and whether I should or shouldn't use sync_dist=True when logging in the training step as well. If not, why not?

1 reply

jandonov Mar 22, 2021
Author

@flukeskywalker have you managed to find a solution?

rudaoshi · 2021-03-13T04:05:57Z

rudaoshi
Mar 13, 2021

I have the same problem. I managed to log the synced metric by calling metric.compute(). But the value is not identical with the Checkpoint callback.

Details please find in #6352

And there is a related issue:

#6122

2 replies

jandonov Mar 22, 2021
Author

@rudaoshi have you solved this problem?

rudaoshi Mar 24, 2021

No, I just write my own DDP running engine.

jandonov · 2021-03-15T12:37:07Z

jandonov
Mar 15, 2021
Author

@williamFalcon any chance someone can help us with this?

3 replies

williamFalcon Mar 15, 2021
Maintainer

cc @edenafek

jandonov Mar 16, 2021
Author

Ty for pointing us to the right person @williamFalcon!

jandonov Mar 29, 2021
Author

@williamFalcon Apologies for tagging you again but this seems like a general issue and there are already two more people in this thread that are struggling with the same thing. Unfortunately, @edenafek didn't reply to this thread, so I was wondering if you can point us to a person, place in the documentation where this is explained or a working example that we can use. Will really appreciate it!

jandonov · 2021-03-16T15:34:49Z

jandonov
Mar 16, 2021
Author

@edenafek could you please take a look at the above issue?

0 replies

SkafteNicki · 2021-03-31T15:14:11Z

SkafteNicki
Mar 31, 2021
Collaborator

Hi all,
Sorry we have not got back to you in time, let me try to answer some of your questions:

Is validation_epoch_end only called on rank 0?

No, it is called by all processes

What does the sync_dist flag do:

Here is the essential code:
https://github.com/PyTorchLightning/pytorch-lightning/blob/a72a7992a283f2eb5183d129a8cf6466903f1dc8/pytorch_lightning/core/step_result.py#L108-L115
If sync_dist=True then it will as default call the sync_ddp function which will sum the value across all processes using torch.distributed.all_reduce
https://github.com/PyTorchLightning/pytorch-lightning/blob/a72a7992a283f2eb5183d129a8cf6466903f1dc8/pytorch_lightning/utilities/distributed.py#L120
Use this flag if you want to synchronize the value between different processes.

How to print stuff in distributed lightning:

Recommended is using either the rank_zero_info function. Import as:

from pytorch_lightning.utilities import rank_zero_info

or use the rank_zero_only decorator (imported from the same module) which can be wrapped around any function such that it only gets called on rank=0. Each logger experiment is decorated with rank_zero_experiment which internally calls rank_zero_only
https://github.com/PyTorchLightning/pytorch-lightning/blob/a72a7992a283f2eb5183d129a8cf6466903f1dc8/pytorch_lightning/loggers/base.py#L31-L43

What about pytorch_lightning.metrics (now known as torchmetrics)

Our own metrics have custom synchronization going on. Any metric will automatically synchronize between different processes whenever metric.compute() is called. Metrics calculated this way should therefore not be logged using sync_dist=True.

Recommended way of logging:

Using self.log in your lightning module

Not sure this answers all questions.

22 replies

Alec-Stashevsky Apr 2, 2023

@SkafteNicki I am looking at the code you provide for self.log and see that the default sync operation is "mean", not "sum". Can you confirm this is your understanding as well?

SkafteNicki Apr 3, 2023
Collaborator

Hi @Alec-Stashevsky,
You are completely right that the default option is mean and not sum, the relevant code line:
https://github.com/Lightning-AI/lightning/blob/d5ca30aaf15aadeab8a1186430e230eb00c8eb4f/src/lightning/pytorch/core/module.py#L356
Cannot remember if sum was the original default when this discussion was created two years ago.

krunolp Oct 10, 2023

@12michi34 I think that the same holds for training, I've been wondering the same and think I found the answer in https://lightning.ai/forums/t/synchronize-train-logging/1270.

mfoglio Apr 19, 2024

In conclusion, should we use sync_dist=True when using ddp (and ddp_find_unused_parameters)?

davidgill97 Jun 19, 2025

IMO, sync_dist is very confusing and dangerous to use, and i think it should be avoided when possible. Using sync_dist means that the metrics are computed at multiple devices and reduced when sync_dist=True, but this has an obvious problem: the reduced metric is not be same as metric computed from aggregated outputs and labels. (E.g. mAP..)

There is a dedicated torchmetrics library allows easier control over aggregation and computation, so i think it is best to use it to calculate metrics, instead of sync_dist. Even for very specific custom metrics, i felt like it's much self-contained and less confusing to implement by inheriting from torchmetrics.Metric.

Proper way to log things when using DDP #6501

Uh oh!

Replies: 5 comments · 28 replies

Uh oh!

Uh oh!

jandonov Mar 22, 2021 Author

Uh oh!

Uh oh!

Uh oh!

jandonov Mar 22, 2021 Author

Uh oh!

Uh oh!

jandonov Mar 15, 2021 Author

Uh oh!

williamFalcon Mar 15, 2021 Maintainer

Uh oh!

jandonov Mar 16, 2021 Author

Uh oh!

jandonov Mar 29, 2021 Author

Uh oh!

Uh oh!

jandonov Mar 16, 2021 Author

Uh oh!

SkafteNicki Mar 31, 2021 Collaborator

Uh oh!

Uh oh!

SkafteNicki Apr 3, 2023 Collaborator

Uh oh!

Uh oh!

Uh oh!

Replies: 5 comments 28 replies

jandonov Mar 22, 2021
Author

jandonov Mar 22, 2021
Author

jandonov
Mar 15, 2021
Author

williamFalcon Mar 15, 2021
Maintainer

jandonov Mar 16, 2021
Author

jandonov Mar 29, 2021
Author

jandonov
Mar 16, 2021
Author

SkafteNicki
Mar 31, 2021
Collaborator

SkafteNicki Apr 3, 2023
Collaborator