-
Notifications
You must be signed in to change notification settings - Fork 415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Classification metrics consistency #655
Comments
Hi! thanks for your contribution!, great first issue! |
Additional question, if I am correct, # Get the case
if preds.ndim == 1 and preds_float:
case = DataType.BINARY
elif preds.ndim == 1 and not preds_float:
case = DataType.MULTICLASS
elif preds.ndim > 1 and preds_float:
case = DataType.MULTILABEL
else:
case = DataType.MULTIDIM_MULTICLASS Would |
Hi @mayeroa, thanks for raising this issue. The optimal case would be that all metrics support I think it is pretty clear to me that the hole classification package need to be refactored, and IMO we need to add an argument called |
Issue will be fixed by classification refactor: see this issue #1001 and this PR #1195 for all changes Small recap: This issue describes a number of inconsistencies in the classification package especially regarding how binary tasks are dealt with, both for single dimensional input but also multi dimensional input (like in segmentation). In the refactor all metrics have been split into a
Issue will be closed when #1195 is merged. |
🚀 Feature
First of all, thank you very much for this awesome project. It helps a lot evaluating deep learning models on standard and out-of-the-box metrics for several application domain.
My request(s) will be more focused on classification metrics (also applied to semantic segmentation). Maybe, we will need to split that into several issues. Do not hesitate.
Motivation
I have made some tests to check if all metrics taking the same inputs work as expected. I have multiple scenarios in mind, since semantic segmentation is a pixel-wise classification task :
Doing those tests (mentioned below), I noticed some inconsistency between the different metrics (shape of inputs, handling binary scenario, parameters name, parameters default values, etc.). The idea would be to standardize more the interface of each classification metrics.
Pitch
Let us take the different scenarii in mind:
Binary classification
Binary classification is acting as expected, except for the
KLDivergence
metric which complain with following exception:RuntimeError: Predictions and targets are expected to have the same shape
I have several questions regarding the binary scenario:
num_classes
set to 2? Having set to 1 but some metrics complain about it (ConfusionMatrix
,IoU
)?num_classes
to 2 is considered asbinary
ormulti-class
mode? I imagine it depends on the metric.num_classes
asNone
value for binary mode, some others don't. It is not really clear, to my mind, how to handle properly binary classification with all those metrics.Multi-class classification
This mode seems to work as expected, and being consistent since
num_classes
is clearly defined here. Hence, the same question, is computing the metrics with 2 classes equivalent tobinary mode
. If not, how to handle properly that mode.Code that is properly working (expect for the
KLDivergence
, same issue as mentioned above).Binary semantic segmentation
Since
TorchMetric
supports extra dimensions for logits and targets, these classification metrics may also be used for semantic segmentation tasks. But for that type of task, I ended up with some issues.Let us take an example of 10 images of 32x32 pixels.
BinnedAveragePrecision
,BinnedPrecisionRecallCurve
andBinnedRecallAtFixedPrecision
are failing with following exception:RuntimeError: The size of tensor a (2) must match the size of tensor b (32) at non-singleton dimension 2
While it is stated in the documentation (https://torchmetrics.readthedocs.io/en/latest/references/modules.html#binnedrecallatfixedprecision) that
forward
should accept that type of format (logits =(N, C, ...)
and targets =(N, ...)
).Accuracy
has a default value formdmc_average
set toglobal
, while other metrics (Precision
,FBeta
,Specificity
, etc) have it set toNone
-> Need to be consistent on this. Either everything is set toNone
orglobal
. I would argue that setting the default toglobal
would be ideal since the user would be able to use those metrics for classification or semantic segmentation tasks seamlessly.I notice that
StatScores
usesmdmc_reduce
parameter while the other metrics called itmdmc_average
. I think, it would be suitable to be consistent over the name of this parameter. The easiest change would be to adopt definitelymdmc_average
(since onlyStatScores
would need to be updated).Once again, how should we properly deal with
binary
data. For the example, I use a 2-classes workaround but is there a better of handlingbinary
semantic segmentation tasks?Multi-class semantic segmentation
The scenario is the same as the previous one while setting
num_classes
to a higher value than 2.Conclusion
This issue gathers different topics that may be need to be treated separately but have in common API consistency.
We can start a discussion about it, but my main question, regarding that long text, is how to properly deal with
binary
classification and semantic segmentation.BCELossWithLogits
interface for example, meaning having theprobabilities
andtargets
having the exact same shape (for instance(10, 32, 32)
in my example).I would like to have your thoughts about that. :)
And sorry for that long issue but I wanted to be as clear as possible, providing meaningful example.
Alternatives
Document more on how each metric should be used in
binary
tasks.Additional context
The main idea of all those requests is to ease the use of those metrics and standardize interfaces (shape of inputs,
num_classes
parameters, etc.). I understand that is a tough topic but it matters.The text was updated successfully, but these errors were encountered: