Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using pytorch-lightning to train PixelCL on multi-gpu #11

Open
ahmed-bensaad opened this issue Feb 1, 2021 · 11 comments
Open

Using pytorch-lightning to train PixelCL on multi-gpu #11

ahmed-bensaad opened this issue Feb 1, 2021 · 11 comments

Comments

@ahmed-bensaad
Copy link

ahmed-bensaad commented Feb 1, 2021

Hello everyone,

I'm trying to use pytorch-lightning to train PixelCL on 2 gpus using ddp2 accelerator.

I followed this example :

class SelfSupervisedLearner(pl.LightningModule):
    def __init__(self, net, n_epochs, steps_per_epoch, **kwargs):
        super().__init__()
        self.learner = PixelCL(net, **kwargs)
        self.n_epochs = n_epochs
        self.steps_per_epoch = steps_per_epoch

    def forward(self, images):
        return self.learner(images)

    def training_step(self, batch, _):
        images,_ = batch
        loss, _ = self.forward(images)
        self.log('loss',loss, on_step=True, on_epoch=True, prog_bar=True, logger=True)
        return {'loss': loss}

    def configure_optimizers(self):
        opt = LARS(self.parameters(), lr=1e-3, weight_decay = 1e-3)#torch.optim.Adam(self.parameters(), lr=1e-3)
        scheduler = scheduler = OneCycleLR(opt, max_lr = get_lr(opt), epochs = self.n_epochs, steps_per_epoch = self.steps_per_epoch)        return [opt], [scheduler]


    def on_before_zero_grad(self, _):
        self.learner.update_moving_average()`
    learner = SelfSupervisedLearner(
        resnet,
        n_epochs = 200,
        steps_per_epoch = len(train_loader),
        image_size = 244,
        hidden_layer = 'layer4',        # leads to output of 8x8 feature map
        projection_size = 256,          # size of projection output, 256 was used in the paper
        projection_hidden_size = 2048,  # size of projection hidden dimension, paper used 2048
        moving_average_decay = 0.99,    # exponential moving average decay of target encoder
        ppm_num_layers = 1,             # number of layers for transform function in the pixel propagation module, 1 was optimal
        ppm_gamma = 2,                  # sharpness of the similarity in the pixel propagation module, already at optimal value of 2
        distance_thres = 0.7,           # ideal value is 0.7, as indicated in the paper, which makes the assumption of each feature map's pixel diagonal distance to be $
        similarity_temperature = 0.3,   # temperature for the cosine similarity for the pixel contrastive loss
        alpha = 1.                      # weight of the pixel propagation loss (pixpro) vs pixel CL loss
    ).cuda()


    trainer = pl.Trainer(
        gpus = 2,
        max_epochs = 200,
        accumulate_grad_batches = 1,
        sync_batchnorm = False,
        callbacks = [ckpt_callback],
        accelerator = 'ddp2'
    )

    trainer.fit(learner, train_loader)

When I try to run this I got the following error :

0: Traceback (most recent call last):
0:   File "../swav_src/light_seg_byol.py", line 166, in <module>
0:     trainer.fit(learner, train_loader)
0:   File "/gpfs/users/bensaad/.local/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 510, in fit
0:     results = self.accelerator_backend.train()
0:   File "/gpfs/users/bensaad/.local/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp2_accelerator.py", line 64, in train
0:     return self.ddp_train(process_idx=self.task_idx, mp_queue=None, model=model)
0:   File "/gpfs/users/bensaad/.local/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp2_accelerator.py", line 202, in ddp_train
0:     results = self.train_or_test()
0:   File "/gpfs/users/bensaad/.local/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 74, in train_or_test
0:     results = self.trainer.train()
0:   File "/gpfs/users/bensaad/.local/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 561, in train
0:     self.train_loop.run_training_epoch()
0:   File "/gpfs/users/bensaad/.local/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 549, in run_training_epoch
0:     batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
0:   File "/gpfs/users/bensaad/.local/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 704, in run_training_batch
0:     self.optimizer_step(optimizer, opt_idx, batch_idx, train_step_and_backward_closure)
0:   File "/gpfs/users/bensaad/.local/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 490, in optimizer_step
0:     using_lbfgs=is_lbfgs,
0:   File "/gpfs/users/bensaad/.local/lib/python3.7/site-packages/pytorch_lightning/core/lightning.py", line 1296, in optimizer_step
0:     optimizer.step(closure=optimizer_closure)
0:   File "/gpfs/users/bensaad/.local/lib/python3.7/site-packages/pytorch_lightning/core/optimizer.py", line 286, in step
0:     self.__optimizer_step(*args, closure=closure, profiler_name=profiler_name, **kwargs)
0:   File "/gpfs/users/bensaad/.local/lib/python3.7/site-packages/pytorch_lightning/core/optimizer.py", line 144, in __optimizer_step
0:     optimizer.step(closure=closure, *args, **kwargs)
0:   File "/gpfs/users/bensaad/.local/lib/python3.7/site-packages/torch/optim/lr_scheduler.py", line 67, in wrapper
0:     return wrapped(*args, **kwargs)
0:   File "/gpfs/users/bensaad/namr/swav_src/lars.py", line 83, in step
0:     loss = closure()
0:   File "/gpfs/users/bensaad/.local/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 699, in train_step_and_backward_closure
0:     self.trainer.hiddens
0:   File "/gpfs/users/bensaad/.local/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 792, in training_step_and_backward
0:     result = self.training_step(split_batch, batch_idx, opt_idx, hiddens)
0:   File "/gpfs/users/bensaad/.local/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 316, in training_step
0:     training_step_output = self.trainer.accelerator_backend.training_step(args)
0:   File "/gpfs/users/bensaad/.local/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp2_accelerator.py", line 67, in training_step
0:     return self._step(args)
0:   File "/gpfs/users/bensaad/.local/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp2_accelerator.py", line 81, in _step
0:     output = self.trainer.model(*args)
0:   File "/gpfs/users/bensaad/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
0:     result = self.forward(*input, **kwargs)
0:   File "/gpfs/users/bensaad/.local/lib/python3.7/site-packages/pytorch_lightning/overrides/data_parallel.py", line 188, in forward
0:     outputs = self.parallel_apply(self._module_copies[:len(inputs)], inputs, kwargs)
0:   File "/gpfs/users/bensaad/.local/lib/python3.7/site-packages/pytorch_lightning/overrides/data_parallel.py", line 161, in parallel_apply
0:     return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
0:   File "/gpfs/users/bensaad/.local/lib/python3.7/site-packages/pytorch_lightning/overrides/data_parallel.py", line 321, in parallel_apply
0:     raise output
0:   File "/gpfs/users/bensaad/.local/lib/python3.7/site-packages/pytorch_lightning/overrides/data_parallel.py", line 274, in _worker
0:     output = module.training_step(*input, **kwargs)
0:   File "../swav_src/light_seg_byol.py", line 66, in training_step
0:     loss, _ = self.forward(images)
0:   File "../swav_src/light_seg_byol.py", line 62, in forward
0:     return self.learner(images)
0:   File "/gpfs/users/bensaad/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
0:     result = self.forward(*input, **kwargs)
0:   File "/gpfs/users/bensaad/.local/lib/python3.7/site-packages/pixel_level_contrastive_learning/pixel_level_contrastive_learning.py", line 294, in forward
0:     proj_one = self.online_encoder(image_one_cutout)
0:   File "/gpfs/users/bensaad/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
0:     result = self.forward(*input, **kwargs)
0:   File "/gpfs/users/bensaad/.local/lib/python3.7/site-packages/pixel_level_contrastive_learning/pixel_level_contrastive_learning.py", line 201, in forward
0:     projection = projector(representation)
0:   File "/gpfs/users/bensaad/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
0:     result = self.forward(*input, **kwargs)
0:   File "/gpfs/users/bensaad/.local/lib/python3.7/site-packages/pixel_level_contrastive_learning/pixel_level_contrastive_learning.py", line 107, in forward
0:     return self.net(x)
0:   File "/gpfs/users/bensaad/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
0:     result = self.forward(*input, **kwargs)
0:   File "/gpfs/users/bensaad/.local/lib/python3.7/site-packages/torch/nn/modules/container.py", line 100, in forward
0:     input = module(input)
0:   File "/gpfs/users/bensaad/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
0:     result = self.forward(*input, **kwargs)
0:   File "/gpfs/users/bensaad/.local/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 353, in forward
0:     return self._conv_forward(input, self.weight)
0:   File "/gpfs/users/bensaad/.local/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 350, in _conv_forward
0:     self.padding, self.dilation, self.groups)
0: RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 1 does not equal 0 (while checking arguments for cudnn_convolution)
0: Traceback (most recent call last):
0:   File "../swav_src/light_seg_byol.py", line 166, in <module>
0:     trainer.fit(learner, train_loader)
0:   File "/gpfs/users/bensaad/.local/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 510, in fit
0:     results = self.accelerator_backend.train()
0:   File "/gpfs/users/bensaad/.local/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp2_accelerator.py", line 64, in train
0:     return self.ddp_train(process_idx=self.task_idx, mp_queue=None, model=model)
0:   File "/gpfs/users/bensaad/.local/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp2_accelerator.py", line 202, in ddp_train
0:     results = self.train_or_test()
0:   File "/gpfs/users/bensaad/.local/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 74, in train_or_test
0:     results = self.trainer.train()
0:   File "/gpfs/users/bensaad/.local/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 561, in train
0:     self.train_loop.run_training_epoch()
0:   File "/gpfs/users/bensaad/.local/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 549, in run_training_epoch
0:     batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
0:   File "/gpfs/users/bensaad/.local/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 704, in run_training_batch
0:     self.optimizer_step(optimizer, opt_idx, batch_idx, train_step_and_backward_closure)
0:   File "/gpfs/users/bensaad/.local/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 490, in optimizer_step
0:     using_lbfgs=is_lbfgs,
0:   File "/gpfs/users/bensaad/.local/lib/python3.7/site-packages/pytorch_lightning/core/lightning.py", line 1296, in optimizer_step
0:     optimizer.step(closure=optimizer_closure)
0:   File "/gpfs/users/bensaad/.local/lib/python3.7/site-packages/pytorch_lightning/core/optimizer.py", line 286, in step
0:     self.__optimizer_step(*args, closure=closure, profiler_name=profiler_name, **kwargs)
0:   File "/gpfs/users/bensaad/.local/lib/python3.7/site-packages/pytorch_lightning/core/optimizer.py", line 144, in __optimizer_step
0:     optimizer.step(closure=closure, *args, **kwargs)
0:   File "/gpfs/users/bensaad/.local/lib/python3.7/site-packages/torch/optim/lr_scheduler.py", line 67, in wrapper
0:     return wrapped(*args, **kwargs)
0:   File "/gpfs/users/bensaad/namr/swav_src/lars.py", line 83, in step
0:     loss = closure()
0:   File "/gpfs/users/bensaad/.local/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 699, in train_step_and_backward_closure
0:     self.trainer.hiddens
0:   File "/gpfs/users/bensaad/.local/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 792, in training_step_and_backward
0:     result = self.training_step(split_batch, batch_idx, opt_idx, hiddens)
0:   File "/gpfs/users/bensaad/.local/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 316, in training_step
0:     training_step_output = self.trainer.accelerator_backend.training_step(args)
0:   File "/gpfs/users/bensaad/.local/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp2_accelerator.py", line 67, in training_step
0:     return self._step(args)
0:   File "/gpfs/users/bensaad/.local/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp2_accelerator.py", line 81, in _step
0:     output = self.trainer.model(*args)
0:   File "/gpfs/users/bensaad/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
0:     result = self.forward(*input, **kwargs)
0:   File "/gpfs/users/bensaad/.local/lib/python3.7/site-packages/pytorch_lightning/overrides/data_parallel.py", line 188, in forward
0:     outputs = self.parallel_apply(self._module_copies[:len(inputs)], inputs, kwargs)
0:   File "/gpfs/users/bensaad/.local/lib/python3.7/site-packages/pytorch_lightning/overrides/data_parallel.py", line 161, in parallel_apply
0:     return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
0:   File "/gpfs/users/bensaad/.local/lib/python3.7/site-packages/pytorch_lightning/overrides/data_parallel.py", line 321, in parallel_apply
0:     raise output
0:   File "/gpfs/users/bensaad/.local/lib/python3.7/site-packages/pytorch_lightning/overrides/data_parallel.py", line 274, in _worker
0:     output = module.training_step(*input, **kwargs)
0:   File "../swav_src/light_seg_byol.py", line 66, in training_step
0:     loss, _ = self.forward(images)
0:   File "../swav_src/light_seg_byol.py", line 62, in forward
0:     return self.learner(images)
0:   File "/gpfs/users/bensaad/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
0:     result = self.forward(*input, **kwargs)
0:   File "/gpfs/users/bensaad/.local/lib/python3.7/site-packages/pixel_level_contrastive_learning/pixel_level_contrastive_learning.py", line 294, in forward
0:     proj_one = self.online_encoder(image_one_cutout)
0:   File "/gpfs/users/bensaad/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
0:     result = self.forward(*input, **kwargs)
0:   File "/gpfs/users/bensaad/.local/lib/python3.7/site-packages/pixel_level_contrastive_learning/pixel_level_contrastive_learning.py", line 199, in forward
0:     representation = self.get_representation(x)
0:   File "/gpfs/users/bensaad/.local/lib/python3.7/site-packages/pixel_level_contrastive_learning/pixel_level_contrastive_learning.py", line 195, in get_representation
0:     assert hidden is not None, f'hidden layer {self.layer} never emitted an output'
0: AssertionError: hidden layer layer4 never emitted an output

According to this issue Something needs to be done to register the forward hook. But I cannot understand what is it.
Could someone help me please?

Thanks

@lucidrains
Copy link
Owner

@ahmed-bensaad should be fixed in 0.1.0! eda34f7

@lucidrains
Copy link
Owner

@ahmed-bensaad you should also have sync batchnorm turned on, as brought to light by this paper https://arxiv.org/abs/2101.07525

@ahmed-bensaad
Copy link
Author

@lucidrains Thank you for your response. Indeed his error has been fixed but now I have a new one

0: RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`; (2) making sure all `forward` function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).

for (1) : According to this issue, It is not yet possible to set find_unused_parameters=False when using pytorch lightning
for (2) : I think this can be dealt with in the forward function of the model :

    def forward(self, x):
        shape, device, prob_flip = x.shape, x.device, self.prob_rand_hflip

        rand_flip_fn = lambda t: torch.flip(t, dims = (-1,))
        . . .
        return loss, positive_pixel_pairs

@ahmed-bensaad you should also have sync batchnorm turned on, as brought to light by this paper https://arxiv.org/abs/2101.07525

sync batchnorm doesn't seem to work with ddp2 accelerator

0: ValueError: SyncBatchNorm is only supported for DDP with single GPU per process 

@lucidrains
Copy link
Owner

Sure! I can add that! Why did you close the PR lol

@ahmed-bensaad
Copy link
Author

ahmed-bensaad commented Feb 2, 2021

I closed the issue because the workaround proposed here works fine for me. It is mainly a pytorch lightning issue not related to this repository

@lucidrains
Copy link
Owner

@ahmed-bensaad ahh, thanks for that, but I want to make this work for everyone lol

https://github.com/lucidrains/pixel-level-contrastive-learning/releases/tag/0.1.1 should be good now!

@lucidrains
Copy link
Owner

@ahmed-bensaad would you be willing to share your script with a pull request after it works? :D

@ahmed-bensaad
Copy link
Author

Of course I will. With another (very) minor change to the package.

@lucidrains
Copy link
Owner

did you ever get this to work?

@ahmed-bensaad
Copy link
Author

ahmed-bensaad commented Feb 5, 2021 via email

@lucidrains
Copy link
Owner

woohoo!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants