incorrect batch_sizes when Dataloader returns a dict with multiple tensors. #3668

gerardsn · 2020-09-26T15:38:17Z

🐛 Bug

Tracked batch sizes in result object are incorrect when a Dataloader returns a dict with multiple tensors.

To Reproduce

Create data loader that returns a dict, e.g. batch = {'batchA': tensor_A, 'batchB': tensor_B}.
Both entires have batch size N with N != 2.
For this example a batch size of 2 will be logged since len(batch) == 2.

https://github.com/PyTorchLightning/pytorch-lightning/blob/05e5f03fd7c851b06ca5e34b39eb660857b8f00c/pytorch_lightning/trainer/evaluation_loop.py#L147-L150
https://github.com/PyTorchLightning/pytorch-lightning/blob/05e5f03fd7c851b06ca5e34b39eb660857b8f00c/pytorch_lightning/trainer/training_loop.py#L304-L306

Expected behavior

Log correct batch size.
I'm not sure what can be defined as the 'correct' batch size when there are multiple tensors, but I expect that each tensor in the dict has the same batch_size. So, maybe something like:

if is_result_obj:
    if isinstance(batch, dict):
        batch = batch[list(batch.keys())[0]]
    result_obj.track_batch_size(len(batch))

rohitgr7 · 2020-10-01T20:33:02Z

I think doing just len(batch) is still wrong since here if the batch is a tuple or some kind of custom batch datatype then len(batch) will be wrong. Considering the basic mnist example too it will give 2 only which is wrong.

gerardsn · 2020-10-02T03:17:19Z

This should probably catch most things. Might be a bit much though.
It returns 1 if it fails to determine the batch size to prevent issues with weighted averaging in reduce_on_epoch_end.

if is_result_obj:
    result_obj.track_batch_size(unpack_batchsize(batch))

# maybe add as staticmethod to ResultObj?
def unpack_batchsize(sample):
    """ 
    Recursively unpack sample to find a torch.Tensor.
    returns len(tensor) when found, or 1 when it hits an empty or non iterable.
    """
    if isinstance(sample, torch.Tensor):
        sample = len(sample)
    elif isinstance(sample, dict):
        sample = next(iter(sample.values()), 1)
    elif isinstance(sample, Iterable):
        sample = next(iter(sample), 1)
    else:
        sample = 1  

    if isinstance(sample, int):
        return sample
    return unpack_batchsize(sample)

carmocca · 2020-10-02T03:59:08Z

I suggest adding a function to the LightningModule batch_len_fx which defaults to len if it is not overriden. Anything could be a batch and lightning shouldn't have the responsability of supporting any batch type.

rohitgr7 · 2020-10-02T06:04:56Z

Exactly what I had in mind @carmocca. Or maybe simple ask to put batch_size in .log itself if on_epoch=True??

.log('some_metric', metric_value, on_epoch=True, batch_size=batch_size)
.log('some_metric', metric_value, on_epoch=False)

gerardsn · 2020-10-02T06:07:11Z

Lightning currently defaults to weighted_mean for reduction on epoch end by substituting the reduction method if it is torch.mean:

https://github.com/PyTorchLightning/pytorch-lightning/blob/ebc1b23fa38d54e9805aa4356867369f064c7031/pytorch_lightning/core/step_result.py#L389-L390

If this is the desired behaviour, I think Lightning should at least attempt getting a reasonable estimate for the batch size. In most use cases the dataloader will return multiple tensors, resulting in an incorrect batch estimate if len is the default. (e.g. any supervised method has at least (X, y) in its batch, producing len(batch)=2 as mentioned by @rohitgr7)

This could still be done using batch_len_fx though. On first call, if the method is not overriden, replace the batch_len_fx with a reasonable estimate based on the type of batch. (e.g. len of [tensor, first value in Iterable])

gerardsn · 2020-10-02T06:18:16Z

Exactly what I had in mind @carmocca. Or maybe simple ask to put batch_size in .log itself if on_epoch=True??
.log('some_metric', metric_value, on_epoch=True, batch_size=batch_size)
.log('some_metric', metric_value, on_epoch=False)

this should work too. Probably default to 1 if not provided since len is likely to be wrong.

fogside · 2020-10-02T20:54:24Z

@gerardsn I have a problem exactly with this weighted_mean function.
I'm working with the latest Lightning version from master.

https://github.com/PyTorchLightning/pytorch-lightning/blob/ebc1b23fa38d54e9805aa4356867369f064c7031/pytorch_lightning/core/step_result.py#L369

It gets outputs = [{'checkpoint_on': tensor(28.3303, device='cuda:0'), 'val_loss': tensor(28.3303, device='cuda:0'), 'val_precision1': 0.12652068126520682}].
Because I have only one batch per epoch in validation.

Lightning tries to reduce on epoch end.
It feeds
result = tensor([27.8364], device='cuda:0'), weights=tensor([2]), into the weighted_mean function and I get an error here:
https://github.com/PyTorchLightning/pytorch-lightning/blob/ebc1b23fa38d54e9805aa4356867369f064c7031/pytorch_lightning/core/step_result.py#L897
AttributeError: 'list' object has no attribute 'device'
I think it's related to this issue. It would be nice to not reduce anything if it's just one batch per epoch.

rohitgr7 · 2020-10-02T22:15:58Z

@fogside in your example result is a tensor so result.device should not though an error.

fogside · 2020-10-02T22:17:58Z

@fogside in your example result is a tensor so result.device should not though an error.

But it's a list with a tensor inside.

rohitgr7 · 2020-10-02T22:21:30Z

result = tensor([27.8364], device='cuda:0'), weights=tensor([2])

you refering to this right?

fogside · 2020-10-02T22:29:07Z

result = tensor([27.8364], device='cuda:0'), weights=tensor([2])

you refering to this right?

Sorry, I just realized, that I was mistaken.
Actually it calls this method twice for some reason.
I added prints at the beginning of weighted_mean function and at the reduce_on_epoch_end (I also changed the number of batches in this example)

Result[k] tensor([23.6331, 26.0617, 24.0941, 25.3255], device='cuda:0')
result:  tensor([23.6331, 26.0617, 24.0941, 25.3255], device='cuda:0')
weights:  tensor([2, 2, 2, 2])
Result[k] [0.14285714285714285, 0.06451612903225806, 0.056179775280898875, 0.13793103448275862]
result:  [0.14285714285714285, 0.06451612903225806, 0.056179775280898875, 0.13793103448275862]
weights:  tensor([2, 2, 2, 2])

And on the second time it gives me the error.

rohitgr7 · 2020-10-02T22:30:45Z

are you logging non-tensor values? maybe doing .item() somewhere in the logs? if not, can you put .log statements here??

fogside · 2020-10-02T22:33:17Z

are you logging non-tensor values?

Yes, I was calculating precision in numpy.. Isn't it possible to log non-tensor values?

rohitgr7 · 2020-10-02T22:34:43Z

no.. also to calculate precision or anyother metric you can try pl.metrics package which computes these metrics on the current device itself.

or you can just do torch.tensor(numpy_value) in .log

fogside · 2020-10-02T22:42:38Z

no.. also to calculate precision or another metric you can try pl.metrics package which does all of these on the current device itself.

I see. Thank you!
Actually I was trying to work with pytorch-metric-learning and used some function for topk precision estimation from there. But it looks quite tough to merge these 2 frameworks. I see now that topk precision should be calculated in pytorch. Another thing is that I need to have as big batch as possible to get a good topk estimation (it's even better to have the whole val set), that's why I found it hard to make this estimations in the validation_step. Maybe I should look into some Callbacks?
But it's not related to this issue.

rohitgr7 · 2020-10-02T22:51:57Z

already working on topk accuracy. Maybe will add topk precision and recall in pl.metrics as well. Can you point me to the implementation of topk precision in pytorch-metric-learning package. It would be helpful. Thanks :)

fogside · 2020-10-03T08:05:17Z

already working on topk accuracy. Maybe will add topk precision and recall in pl.metrics as well. Can you point me to the implementation of topk precision in pytorch-metric-learning package. It would be helpful. Thanks :)

It's great!
Sure. I used the class AccuracyCalculator like this

accuracy_calculator = AccuracyCalculator(include=("mean_average_precision_at_r"),  k=5)
accuracies = self.accuracy_calculator.get_accuracy(embeddings,
                                                           embeddings,
                                                           labels,
                                                           labels,
                                                           True)

Implementation:
https://github.com/KevinMusgrave/pytorch-metric-learning/blob/10bed5ee8719a543827aa32ea658603c2fcb0130/src/pytorch_metric_learning/utils/accuracy_calculator.py#L45

rohitgr7 · 2020-10-03T14:32:57Z

So I guess 2 things should be fixed:

Track correct batch_size
Allow non-tensor numeric values in .log(...)

williamFalcon · 2020-10-05T13:43:15Z

@gerardsn I have a problem exactly with this weighted_mean function.
I'm working with the latest Lightning version from master.

https://github.com/PyTorchLightning/pytorch-lightning/blob/ebc1b23fa38d54e9805aa4356867369f064c7031/pytorch_lightning/core/step_result.py#L369

It gets outputs = [{'checkpoint_on': tensor(28.3303, device='cuda:0'), 'val_loss': tensor(28.3303, device='cuda:0'), 'val_precision1': 0.12652068126520682}].
Because I have only one batch per epoch in validation.

Lightning tries to reduce on epoch end.
It feeds
result = tensor([27.8364], device='cuda:0'), weights=tensor([2]), into the weighted_mean function and I get an error here:
https://github.com/PyTorchLightning/pytorch-lightning/blob/ebc1b23fa38d54e9805aa4356867369f064c7031/pytorch_lightning/core/step_result.py#L897

AttributeError: 'list' object has no attribute 'device'
I think it's related to this issue. It would be nice to not reduce anything if it's just one batch per epoch.

this is fixed on master

williamFalcon · 2020-10-05T13:44:31Z

ok, making changes to this today.

What do we want as the default behavior? doesn't the custom reduce function solve the problem of custom batches etc?

rohitgr7 · 2020-10-05T14:43:52Z

The batches are not tracked correctly.

* Fixes #3668, #3887 as a bonus * Fixes #3668, #3887 as a bonus

gerardsn added bug Something isn't working help wanted Open to be worked on labels Sep 26, 2020

awaelchli added the ResultObj label Sep 26, 2020

edenlightning removed the ResultObj label Oct 2, 2020

edenlightning added this to the 0.9.x milestone Oct 2, 2020

edenlightning assigned williamFalcon Oct 2, 2020

edenlightning added the priority: 0 High priority task label Oct 2, 2020

edenlightning changed the title ~~Result object incorrect batch_sizes~~ incorrect batch_sizes when Dataloader returns a dict with multiple tensors. Oct 2, 2020

rohitgr7 mentioned this issue Oct 4, 2020

Incorrect batch size tracking in training and validation steps #3840

Closed

edenlightning modified the milestones: 0.9.x, 1.0 Oct 4, 2020

edenlightning added the v1.0 allowed label Oct 4, 2020

edenlightning removed the help wanted Open to be worked on label Oct 5, 2020

williamFalcon added a commit that referenced this issue Oct 6, 2020

Fixes #3668, #3887 as a bonus

aca8e53

williamFalcon mentioned this issue Oct 6, 2020

Fixes #3668, #3887 as a bonus #3888

Merged

williamFalcon added a commit that referenced this issue Oct 6, 2020

Fixes #3668, #3887 as a bonus

d954d0d

williamFalcon closed this as completed in #3888 Oct 6, 2020

williamFalcon added a commit that referenced this issue Oct 6, 2020

Fixes #3668, #3887 as a bonus (#3888)

b34c7ad

* Fixes #3668, #3887 as a bonus * Fixes #3668, #3887 as a bonus

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

incorrect batch_sizes when Dataloader returns a dict with multiple tensors. #3668

incorrect batch_sizes when Dataloader returns a dict with multiple tensors. #3668

gerardsn commented Sep 26, 2020

rohitgr7 commented Oct 1, 2020

gerardsn commented Oct 2, 2020

carmocca commented Oct 2, 2020 •

edited

Loading

rohitgr7 commented Oct 2, 2020

gerardsn commented Oct 2, 2020

gerardsn commented Oct 2, 2020

fogside commented Oct 2, 2020 •

edited

Loading

rohitgr7 commented Oct 2, 2020

fogside commented Oct 2, 2020 •

edited

Loading

rohitgr7 commented Oct 2, 2020

fogside commented Oct 2, 2020 •

edited

Loading

rohitgr7 commented Oct 2, 2020 •

edited

Loading

fogside commented Oct 2, 2020

rohitgr7 commented Oct 2, 2020 •

edited

Loading

fogside commented Oct 2, 2020

rohitgr7 commented Oct 2, 2020 •

edited

Loading

fogside commented Oct 3, 2020 •

edited

Loading

rohitgr7 commented Oct 3, 2020

williamFalcon commented Oct 5, 2020

williamFalcon commented Oct 5, 2020

rohitgr7 commented Oct 5, 2020

incorrect batch_sizes when Dataloader returns a dict with multiple tensors. #3668

incorrect batch_sizes when Dataloader returns a dict with multiple tensors. #3668

Comments

gerardsn commented Sep 26, 2020

🐛 Bug

To Reproduce

Expected behavior

rohitgr7 commented Oct 1, 2020

gerardsn commented Oct 2, 2020

carmocca commented Oct 2, 2020 • edited Loading

rohitgr7 commented Oct 2, 2020

gerardsn commented Oct 2, 2020

gerardsn commented Oct 2, 2020

fogside commented Oct 2, 2020 • edited Loading

rohitgr7 commented Oct 2, 2020

fogside commented Oct 2, 2020 • edited Loading

rohitgr7 commented Oct 2, 2020

fogside commented Oct 2, 2020 • edited Loading

rohitgr7 commented Oct 2, 2020 • edited Loading

fogside commented Oct 2, 2020

rohitgr7 commented Oct 2, 2020 • edited Loading

fogside commented Oct 2, 2020

rohitgr7 commented Oct 2, 2020 • edited Loading

fogside commented Oct 3, 2020 • edited Loading

rohitgr7 commented Oct 3, 2020

williamFalcon commented Oct 5, 2020

williamFalcon commented Oct 5, 2020

rohitgr7 commented Oct 5, 2020

carmocca commented Oct 2, 2020 •

edited

Loading

fogside commented Oct 2, 2020 •

edited

Loading

fogside commented Oct 2, 2020 •

edited

Loading

fogside commented Oct 2, 2020 •

edited

Loading

rohitgr7 commented Oct 2, 2020 •

edited

Loading

rohitgr7 commented Oct 2, 2020 •

edited

Loading

rohitgr7 commented Oct 2, 2020 •

edited

Loading

fogside commented Oct 3, 2020 •

edited

Loading