DDP validation: All gather for flattened 1D tensors taking long time to complete #55

krishansubudhi · 2021-08-10T13:46:55Z

Task = POS tagging

    def val_step(self, global_step: int, batch, device="cpu", encoder = None, encoder_kwargs={}):
        """
        Can return multiple outputs. First output need not be loss.
        """
        ...
        print(rels_predicted.shape)
        return label_loss, pointer_loss, rels_predicted, rels_labels

validation ptb_dep 3:: 0%| | 0/7 [00:00<?, ?it/s]torch.Size([1541])
torch.Size([1547])
torch.Size([1500])
torch.Size([1514])
torch.Size([1570])
torch.Size([1506])
torch.Size([1477])
torch.Size([1626])
validation ptb_dep 2:: 29%|█████████████████████████████████████████████████████████▏ | 2/7 [00:00<00:00, 30.67it/s]
gathering
validation ptb_dep 3:: 29%|█████████████████████████████████████████████████████████▏ | 2/7 [00:00<00:00, 29.47it/s]
gathering
validation ptb_dep 1:: 29%|█████████████████████████████████████████████████████████▏ | 2/7 [00:00<00:00, 28.46it/s]
gathering
validation ptb_dep 0:: 29%|█████████████████████████████████████████████████████████▏ | 2/7 [00:00<00:00, 27.57it/s]
gathering

krishansubudhi · 2021-08-10T14:04:51Z

From more debugging it seems like a bug in gather_tensors_on_cpu

def gather_tensors_on_cpu(self, x: torch.tensor):
        n_samples = len(x)
        self._set_gather_frequency(n_samples)
        gathered = []
        n_chunks = n_samples // self.gather_frequency + 1
        print(n_chunks, n_samples, self.gather_frequency) # Debug code introduced

Output

1510 3018 2

This bug is because of self._set_gather_frequency(n_samples).
In case of multiple outputs, if dimension 0 of first output was 2, then gather_frequency will be set as 2 for rest of the outputs. Class variable assignment needs to be avoided here.

krishansubudhi · 2021-08-10T14:16:01Z

Also if n_chunks is different in 2 processes, all-gather gets stuck as the process with higher number of chunks keeps waiting. For this, either make num_chunks = 1 or gather the num_chunks tensor first and take the maximum.

jsleep · 2021-08-10T22:46:16Z

Thanks for debugging this @krishansubudhi - any chance you'd be able to put these changes in or should we try to resource for next sprint?

krishansubudhi · 2021-08-26T16:59:41Z

I am working on the fix will raise a PR soon.

jsleep added the bug Something isn't working label Aug 10, 2021

jsleep linked a pull request Oct 11, 2021 that will close this issue

ddp backend fix and documentation changes #68

Open

jsleep assigned krishansubudhi Oct 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DDP validation: All gather for flattened 1D tensors taking long time to complete #55

DDP validation: All gather for flattened 1D tensors taking long time to complete #55

krishansubudhi commented Aug 10, 2021

krishansubudhi commented Aug 10, 2021

krishansubudhi commented Aug 10, 2021

jsleep commented Aug 10, 2021

krishansubudhi commented Aug 26, 2021

DDP validation: All gather for flattened 1D tensors taking long time to complete #55

DDP validation: All gather for flattened 1D tensors taking long time to complete #55

Comments

krishansubudhi commented Aug 10, 2021

krishansubudhi commented Aug 10, 2021

krishansubudhi commented Aug 10, 2021

jsleep commented Aug 10, 2021

krishansubudhi commented Aug 26, 2021