Classification metrics consistency #655

mayeroa · 2021-12-05T13:58:18Z

🚀 Feature

First of all, thank you very much for this awesome project. It helps a lot evaluating deep learning models on standard and out-of-the-box metrics for several application domain.

My request(s) will be more focused on classification metrics (also applied to semantic segmentation). Maybe, we will need to split that into several issues. Do not hesitate.

Motivation

I have made some tests to check if all metrics taking the same inputs work as expected. I have multiple scenarios in mind, since semantic segmentation is a pixel-wise classification task :

Binary classification
Multi-class classification
Binary semantic segmentation
Multi-class semantic segmentation

Doing those tests (mentioned below), I noticed some inconsistency between the different metrics (shape of inputs, handling binary scenario, parameters name, parameters default values, etc.). The idea would be to standardize more the interface of each classification metrics.

Pitch

Let us take the different scenarii in mind:

Binary classification

import torch
from torchmetrics import *


# Inputs
num_classes = 2
logits = torch.randn((10, num_classes))
targets = torch.randint(0, num_classes, (10, ))
probabilities = torch.softmax(logits, dim=1)

# Compute classification metrics
metrics = MetricCollection({
    'acc': Accuracy(num_classes=num_classes),
    'average_precision': AveragePrecision(num_classes=num_classes),
    'auroc': AUROC(num_classes=num_classes),
    'binned_average_precision': BinnedAveragePrecision(num_classes=num_classes, thresholds=5),
    'binned_precision_recall_curve': BinnedPrecisionRecallCurve(num_classes=num_classes, thresholds=5),
    'binned_recall_at_fixed_precision': BinnedRecallAtFixedPrecision(num_classes=num_classes, min_precision=0.5, thresholds=5),
    'calibration_error': CalibrationError(),
    'cohen_kappa': CohenKappa(num_classes=num_classes),
    'confusion_matrix': ConfusionMatrix(num_classes=num_classes),
    'f1': F1(num_classes=num_classes),
    'f2': FBeta(num_classes=num_classes, beta=2),
    'hamming_distance': HammingDistance(),
    'hinge': Hinge(),
    'iou': IoU(num_classes=num_classes),
    # 'kl_divergence': KLDivergence()
    'matthews_correlation_coef': MatthewsCorrcoef(num_classes=num_classes),
    'precision': Precision(num_classes=num_classes),
    'precision_recall_curve': PrecisionRecallCurve(num_classes=num_classes),
    'recall': Recall(num_classes=num_classes),
    'roc': ROC(num_classes=num_classes),
    'specificity': Specificity(num_classes=num_classes),
    'stat_scores': StatScores(num_classes=num_classes)
})
metrics.update(probabilities, targets)
metrics.compute()

Binary classification is acting as expected, except for the KLDivergence metric which complain with following exception:

RuntimeError: Predictions and targets are expected to have the same shape

I have several questions regarding the binary scenario:

What is considered the best practice? Having num_classes set to 2? Having set to 1 but some metrics complain about it (ConfusionMatrix, IoU)?
Is setting num_classes to 2 is considered as binary or multi-class mode? I imagine it depends on the metric.
Some metrics take num_classes as None value for binary mode, some others don't. It is not really clear, to my mind, how to handle properly binary classification with all those metrics.

Multi-class classification

This mode seems to work as expected, and being consistent since num_classes is clearly defined here. Hence, the same question, is computing the metrics with 2 classes equivalent to binary mode. If not, how to handle properly that mode.

Code that is properly working (expect for the KLDivergence, same issue as mentioned above).

import torch
from torchmetrics import *


# Inputs
num_classes = 5
logits = torch.randn((10, num_classes))
targets = torch.randint(0, num_classes, (10, ))
probabilities = torch.softmax(logits, dim=1)

# Compute classification metrics
metrics = MetricCollection({
    'acc': Accuracy(num_classes=num_classes),
    'average_precision': AveragePrecision(num_classes=num_classes),
    'auroc': AUROC(num_classes=num_classes),
    'binned_average_precision': BinnedAveragePrecision(num_classes=num_classes, thresholds=5),
    'binned_precision_recall_curve': BinnedPrecisionRecallCurve(num_classes=num_classes, thresholds=5),
    'binned_recall_at_fixed_precision': BinnedRecallAtFixedPrecision(num_classes=num_classes, min_precision=0.5, thresholds=5),
    'calibration_error': CalibrationError(),
    'cohen_kappa': CohenKappa(num_classes=num_classes),
    'confusion_matrix': ConfusionMatrix(num_classes=num_classes),
    'f1': F1(num_classes=num_classes),
    'f2': FBeta(num_classes=num_classes, beta=2),
    'hamming_distance': HammingDistance(),
    'hinge': Hinge(),
    'iou': IoU(num_classes=num_classes),
    # 'kl_divergence': KLDivergence(),
    'matthews_correlation_coef': MatthewsCorrcoef(num_classes=num_classes),
    'precision': Precision(num_classes=num_classes),
    'precision_recall_curve': PrecisionRecallCurve(num_classes=num_classes),
    'recall': Recall(num_classes=num_classes),
    'roc': ROC(num_classes=num_classes),
    'specificity': Specificity(num_classes=num_classes),
    'stat_scores': StatScores(num_classes=num_classes)
})
metrics.update(probabilities, targets)
metrics.compute()

Binary semantic segmentation

Since TorchMetric supports extra dimensions for logits and targets, these classification metrics may also be used for semantic segmentation tasks. But for that type of task, I ended up with some issues.

Let us take an example of 10 images of 32x32 pixels.

import torch
from torchmetrics import *


# Inputs
num_classes = 2
logits = torch.randn((10, num_classes, 32, 32))
targets = torch.randint(0, num_classes, (10, 32, 32))
probabilities = torch.softmax(logits, dim=1)

# Compute classification metrics
metrics = MetricCollection({
    'acc': Accuracy(num_classes=num_classes),
    'average_precision': AveragePrecision(num_classes=num_classes),
    'auroc': AUROC(num_classes=num_classes),
    # 'binned_average_precision': BinnedAveragePrecision(num_classes=num_classes, thresholds=5),
    # 'binned_precision_recall_curve': BinnedPrecisionRecallCurve(num_classes=num_classes, thresholds=5),
    # 'binned_recall_at_fixed_precision': BinnedRecallAtFixedPrecision(num_classes=num_classes, min_precision=0.5, thresholds=5),
    'calibration_error': CalibrationError(),
    'cohen_kappa': CohenKappa(num_classes=num_classes),
    'confusion_matrix': ConfusionMatrix(num_classes=num_classes),
    'f1': F1(num_classes=num_classes, mdmc_average='global'),
    'f2': FBeta(num_classes=num_classes, beta=2, mdmc_average='global'),
    'hamming_distance': HammingDistance(),
    # 'hinge': Hinge(),
    'iou': IoU(num_classes=num_classes),
    # 'kl_divergence': KLDivergence(),
    'matthews_correlation_coef': MatthewsCorrcoef(num_classes=num_classes),
    'precision': Precision(num_classes=num_classes, mdmc_average='global'),
    'precision_recall_curve': PrecisionRecallCurve(num_classes=num_classes),
    'recall': Recall(num_classes=num_classes, mdmc_average='global'),
    'roc': ROC(num_classes=num_classes),
    'specificity': Specificity(num_classes=num_classes, mdmc_average='global'),
    'stat_scores': StatScores(num_classes=num_classes, mdmc_reduce='global')
})
metrics.update(probabilities, targets)
metrics.compute()

First of all, BinnedAveragePrecision, BinnedPrecisionRecallCurve and BinnedRecallAtFixedPrecision are failing with following exception:

RuntimeError: The size of tensor a (2) must match the size of tensor b (32) at non-singleton dimension 2

While it is stated in the documentation (https://torchmetrics.readthedocs.io/en/latest/references/modules.html#binnedrecallatfixedprecision) that forward should accept that type of format (logits = (N, C, ...) and targets = (N, ...)).

Accuracy has a default value for mdmc_average set to global, while other metrics (Precision, FBeta, Specificity, etc) have it set to None -> Need to be consistent on this. Either everything is set to None or global. I would argue that setting the default to global would be ideal since the user would be able to use those metrics for classification or semantic segmentation tasks seamlessly.
I notice that StatScores uses mdmc_reduce parameter while the other metrics called it mdmc_average. I think, it would be suitable to be consistent over the name of this parameter. The easiest change would be to adopt definitely mdmc_average(since only StatScores would need to be updated).
Once again, how should we properly deal with binary data. For the example, I use a 2-classes workaround but is there a better of handling binary semantic segmentation tasks?

Multi-class semantic segmentation

The scenario is the same as the previous one while setting num_classes to a higher value than 2.

Conclusion

This issue gathers different topics that may be need to be treated separately but have in common API consistency.
We can start a discussion about it, but my main question, regarding that long text, is how to properly deal with binary classification and semantic segmentation.

Is using 2-classes a suitable workaround?
Or should we be compliant with BCELossWithLogits interface for example, meaning having the probabilities and targets having the exact same shape (for instance (10, 32, 32) in my example).

I would like to have your thoughts about that. :)
And sorry for that long issue but I wanted to be as clear as possible, providing meaningful example.

Alternatives

Document more on how each metric should be used in binary tasks.

Additional context

The main idea of all those requests is to ease the use of those metrics and standardize interfaces (shape of inputs, num_classes parameters, etc.). I understand that is a tough topic but it matters.

The text was updated successfully, but these errors were encountered:

github-actions · 2021-12-05T13:58:59Z

Hi! thanks for your contribution!, great first issue!

mayeroa · 2021-12-06T11:06:30Z

Additional question, if I am correct, TorchMetrics handles those 4 scenarii for now, based on that: https://github.com/PyTorchLightning/metrics/blob/6bcc8b0f7f52b370583aff4a9192d7a399f3da75/torchmetrics/utilities/checks.py#L73

# Get the case
if preds.ndim == 1 and preds_float:
    case = DataType.BINARY
elif preds.ndim == 1 and not preds_float:
    case = DataType.MULTICLASS
elif preds.ndim > 1 and preds_float:
    case = DataType.MULTILABEL
else:
    case = DataType.MULTIDIM_MULTICLASS

Would MULTIDIM_BINARY make sense to handle properly binary semantic segmentation?

SkafteNicki · 2021-12-13T14:50:16Z

Hi @mayeroa, thanks for raising this issue.
You are completely right that there are some problems with the consistency of our classification metrics, which are becoming more and more clear as we gain more users.

The optimal case would be that all metrics support binary, multi-label and multi-class classification and that the user can have as many additional dimensions as the want for each metric. However, this is definitely not the case at the moment.

I think it is pretty clear to me that the hole classification package need to be refactored, and IMO we need to add an argument called task (or something) which can be either binary, multi-label, multi-class which the user needs to provide. However, this is a huge undertaking and I am not sure when we get the bandwidth to do it.

Borda · 2022-01-06T20:52:44Z

cc: @aribornstein @ethanwharris

SkafteNicki · 2022-09-01T15:53:09Z

Issue will be fixed by classification refactor: see this issue #1001 and this PR #1195 for all changes

Small recap: This issue describes a number of inconsistencies in the classification package especially regarding how binary tasks are dealt with, both for single dimensional input but also multi dimensional input (like in segmentation). In the refactor all metrics have been split into a binary_*, multiclass_* and multilabel_* version that all follows the same interface to make the metrics consistent. To be precise regarding this issue:

For Binary classification -> use binary_* metrics
Multi-class classification -> use multiclass_* metrics
Binary semantic segmentation -> use binary_ metrics
Multi-class semantic segmentation -> use multiclass_* metrics

Issue will be closed when #1195 is merged.

mayeroa added the enhancement New feature or request label Dec 5, 2021

Borda added the Important milestonish label Jan 6, 2022

Borda added this to the v0.8 milestone Jan 6, 2022

Borda changed the title ~~[Feature Request] Classification metrics consistency~~ Classification metrics consistency Jan 26, 2022

Borda modified the milestones: v0.8, v0.9 Mar 22, 2022

remtav mentioned this issue Apr 13, 2022

Add BinarySemanticSegmentationTask to properly compute IoU microsoft/torchgeo#245

Open

SkafteNicki mentioned this issue May 3, 2022

Classification refactor #1001

Closed

SkafteNicki modified the milestones: v0.9, v0.10 May 12, 2022

SkafteNicki mentioned this issue Sep 1, 2022

Classification Refactor [rebase & merge] #1195

Merged

6 tasks

Borda closed this as completed in #1195 Sep 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Classification metrics consistency #655

Classification metrics consistency #655

mayeroa commented Dec 5, 2021 •

edited

Loading

github-actions bot commented Dec 5, 2021

mayeroa commented Dec 6, 2021 •

edited

Loading

SkafteNicki commented Dec 13, 2021

Borda commented Jan 6, 2022

SkafteNicki commented Sep 1, 2022

Classification metrics consistency #655

Classification metrics consistency #655

Comments

mayeroa commented Dec 5, 2021 • edited Loading

🚀 Feature

Motivation

Pitch

Binary classification

Multi-class classification

Binary semantic segmentation

Multi-class semantic segmentation

Conclusion

Alternatives

Additional context

github-actions bot commented Dec 5, 2021

mayeroa commented Dec 6, 2021 • edited Loading

SkafteNicki commented Dec 13, 2021

Borda commented Jan 6, 2022

SkafteNicki commented Sep 1, 2022

mayeroa commented Dec 5, 2021 •

edited

Loading

mayeroa commented Dec 6, 2021 •

edited

Loading