Releases: Lightning-AI/torchmetrics
Minor patch release
[1.3.1] - 2024-02-12
Fixed
- Fixed how backprop is handled in
LPIPS
metric (#2326) - Fixed
MultitaskWrapper
not being able to be logged in lightning when using metric collections (#2349) - Fixed high memory consumption in
Perplexity
metric (#2346) - Fixed cached network in
FeatureShare
not being moved to the correct device (#2348) - Fix naming of statistics in
MeanAveragePrecision
with custom max det thresholds (#2367) - Fixed custom aggregation in retrieval metrics (#2364)
- Fixed initialize aggregation metrics with default floating type (#2366)
- Fixed plotting of confusion matrices (#2358)
Full Changelog: v1.3.0...v1.3.1
Key Contributors
@Borda, @fschlatt, @JonasVerbickas, @nsmlzl, @SkafteNicki
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Minor release patch
Full Changelog: v1.3.0...v1.3.0.post0
New Image metrics & wrappers
TorchMetrics v1.3 is out now! This release introduces seven new metrics in the different subdomains of TorchMetrics, adding some nice features to already established metrics. In this blogpost, we present the new metrics with short code samples.
We are happy to see the continued adoption of TorchMetrics in over 19,000 Github repositories projects, and we are proud to release that we have passed 1,800 GitHub stars.
New metrics
The retrieval domain has received one new metric in this release: RetrievalAUROC
. This metric calculates the Area Under the Receiver Operation Curve for document retrieval data. It is similar to the standard AUROC
metric from classification but also supports the additional indexes
argument that all retrieval metrics support.
from torch import tensor
from torchmetrics.retrieval import RetrievalAUROC
indexes = tensor([0, 0, 0, 1, 1, 1, 1])
preds = tensor([0.2, 0.3, 0.5, 0.1, 0.3, 0.5, 0.2])
target = tensor([False, False, True, False, True, False, True])
r_auroc = RetrievalAUROC()
r_auroc(preds, target, indexes=indexes)
# tensor(0.7500)
The image subdomain is receiving two new metrics in v1.3, which brings the total number image-specific metrics in TorchMetrics to 21! As with other metrics, these two new metrics work by comparing a predicted image tensor to a ground truth image, but they focus on different properties for their metric calculation.
-
The first metrics is
SpatialCorrelationCoefficient
. As the name indicates this metric focuses on how well the spatial structure of the predicted image correlates with the ground truth image.import torch torch.manual_seed(42) from torchmetrics.image import SpatialCorrelationCoefficient as SCC preds = torch.randn([32, 3, 64, 64]) target = torch.randn([32, 3, 64, 64]) scc = SCC() scc(preds, target) # tensor(0.0023)
-
The second metrics is
SpatialDistortionIndex
compares the spatial structure of the images, and is especially useful for evaluating multi spectral imagesimport torch from torchmetrics.image import SpatialDistortionIndex preds = torch.rand([16, 3, 32, 32]) target = { 'ms': torch.rand([16, 3, 16, 16]), 'pan': torch.rand([16, 3, 32, 32]), } sdi = SpatialDistortionIndex() sdi(preds, target) # tensor(0.0090)
A new wrapper metric called FeatureShare
has also been added. This can be seen as a specialized version of MetricCollection
that can be combined with metrics that use a neural network as part of their metric calculation. For example, FrechetInceptionDistance
, InceptionScore
, KernelInceptionDistance
all, by default, use an inception network for their metric calculations. When these metrics were combined inside a MetricCollection
, the underlying neural network was still called three times, which is quite redundant and wastes resources. In principle, it should be possible only to call it once and then propagate the value to all metrics, which is exactly what the FeatureShare
wrapper solves.
import torch
from torchmetrics.wrappers import FeatureShare
from torchmetrics import MetricCollection
from torchmetrics.image import FrechetInceptionDistance, KernelInceptionDistance
def fs_wrapper():
fs = FeatureShare([FrechetInceptionDistance(), KernelInceptionDistance(subset_size=10, subsets=2)])
fs.update(torch.randint(255, (50, 3, 64, 64), dtype=torch.uint8), real=True)
fs.update(torch.randint(255, (50, 3, 64, 64), dtype=torch.uint8), real=False)
fs.compute()
def mc_wrapper():
mc = MetricCollection([FrechetInceptionDistance(), KernelInceptionDistance(subset_size=10, subsets=2)])
mc.update(torch.randint(255, (50, 3, 64, 64), dtype=torch.uint8), real=True)
mc.update(torch.randint(255, (50, 3, 64, 64), dtype=torch.uint8), real=False)
mc.compute()
# lets compare (using ipython timeit function)
% timeit fs_wrapper()
# 8.38 s ± 564 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
% timeit mc_wrapper()
# 13.8 s ± 232 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
This will most likely be significantly faster than the alternative metric collection, as show in the code example.
Improved features
In v1.2, several new arguments were added to MeanAveragePrecision
metric from the detection package. This metric has seen a further small improvement in that the argument extended_summary=True
also returns confidence scores. The confidence scores are the score assigned by the model on how confident a given predicted bounding box belongs to a certain class.
from torch import tensor
from torchmetrics.detection import MeanAveragePrecision
# enable extended summary
map_metric = MeanAveragePrecision(extended_summary=True)
preds = [
{
"boxes": torch.tensor([[0.5, 0.5, 1, 1]]),
"scores": torch.tensor([1.0]),
"labels": torch.tensor([0]),
}
]
target = [
{"boxes": torch.tensor([[0, 0, 1, 1]]), "labels": torch.tensor([0])}
]
map_metric.update(preds, target)
result = map_metric.compute()
# new confidence score can be found in the "score" key
confidence_scores = result["scores"]
# in this case confidence_score will have shape (10, 101, 1, 4, 3)
# because
# * We are by default evaluating for 10 different IoU thresholds
# * We evaluate the PR-curve based on 101 linearly spaced locations
# * We only have 1 class (see the labels tensor)
# * There are 4 area sizes we evaluate on (small, medium, large and all)
# * By default `max_detection_thresholds=[1,10,100]` meaning we evaluate for 3 values
From v1.3 all retrieval metrics now support an argument called aggregation
that determines how the metric should be aggregated over different documents. The supported options are "mean", "median", "max", "min"
with the default value being "mean"
which is fully backward compatible with earlier versions of TorchMetrics.
from torch import tensor
from torchmetrics.retrieval import RetrievalHitRate
indexes = tensor([0, 0, 0, 1, 1, 1, 1])
preds = tensor([0.2, 0.3, 0.5, 0.1, 0.3, 0.5, 0.2])
target = tensor([True, False, False, False, True, False, True])
hr2 = RetrievalHitRate(aggregation="max")
hr2(preds, target, indexes=indexes)
# tensor(1.000)
Finally, the SacreBLEU
metric from the text domain now supports even more tokenizers: "ja-mecab", "ko-mecab", "flores101", "flores200”
.
Changes and bugfixes
Users should be aware that from v1.3, TorchMetrics now only supports v1.10 of Pytorch and up (before v1.8). We always try to provide support for Pytorch releases for up to two years.
There have been several bug fixes related to numerical stability in several metrics. For this reason, we always recommend that users use the most recent version of Torchmetrics for the best experience.
Thank you!
As always, we offer a big thank you to all of our community members for their contributions and feedback. Please open an issue in the repo if you have any recommendations for the next metrics we should tackle.
If you want to ask a question or join us in expanding Torchmetrics, please join our discord server, where you can ask questions and get guidance in the #torchmetrics
channel.
🔥 Check out the documentation and code! 🚀
[1.3.0] - 2024-01-10
Added
- Added more tokenizers for
SacreBLEU
metric (#2068) - Added support for logging
MultiTaskWrapper
directly with lightningslog_dict
method (#2213) - Added
FeatureShare
wrapper to share submodules containing feature extractors between metrics (#2120) - Added new metrics to image domain:
- Added
average
argument to multiclass versions ofPrecisionRecallCurve
andROC
(#2084) - Added confidence scores when
extended_summary=True
inMeanAveragePrecision
(#2212) - Added
RetrievalAUROC
metric (#2251) - Added
aggregate
argument to retrieval metrics (#2220) - Added utility functions in
segmentation.utils
for future segmentation metrics (#2105)
Changed
- Changed minimum supported Pytorch version from 1.8 to 1.10 (#2145)
- Changed x-/y-axis order for
PrecisionRecallCurve
to be consistent with scikit-learn (#2183)
Deprecated
- Deprecated
metric._update_called
(#2141) - Deprecated
specicity_at_sensitivity
in favour ofspecificity_at_sensitivity
(#2199)
Fixed
- Fixed support for half precision + CPU in metrics requiring topk operator (#2252)
- Fixed warning incorrectly being raised in
Running
metrics (#2256) - Fixed integration with custom feature extractor in
FID
metric (#2277)
Full Changelog: v1.2.0...v1.3.0
Key Contributors
@Borda, @HoseinAkbarzadeh, @matsumotosan, @miskf...
Lazy imports
[1.2.1] - 2023-11-30
Added
- Added error if
NoTrainInceptionV3
is being initialized withouttorch-fidelity
not being installed (#2143) - Added support for Pytorch
v2.1
(#2142)
Changed
- Change default state of
SpectralAngleMapper
andUniversalImageQualityIndex
to be tensors (#2089) - Use
arange
and repeat for deterministic bincount (#2184)
Removed
- Removed unused
lpips
third-party package as dependency ofLearnedPerceptualImagePatchSimilarity
metric (#2230)
Fixed
- Fixed numerical stability bug in
LearnedPerceptualImagePatchSimilarity
metric (#2144) - Fixed numerical stability issue in
UniversalImageQualityIndex
metric (#2222) - Fixed incompatibility for
MeanAveragePrecision
withpycocotools
backend when too littlemax_detection_thresholds
are provided (#2219) - Fixed support for half precision in Perplexity metric (#2235)
- Fixed device and dtype for
LearnedPerceptualImagePatchSimilarity
functional metric (#2234) - Fixed bug in
Metric._reduce_states(...)
when usingdist_sync_fn="cat"
(#2226) - Fixed bug in
CosineSimilarity
where 2d is expected but 1d input was given (#2241) - Fixed bug in
MetricCollection
when using compute groups andcompute
is called more than once (#2211)
Full Changelog: v1.2.0...v1.2.1
Key Contributors
@Borda, @jankng, @kyle-dorman, @SkafteNicki, @tanguymagne
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Clustering metrics
Torchmetrics v1.2 is out now! The latest release includes 11 new metrics within a new subdomain: Clustering.
In this blog post, we briefly explain what clustering is, why it’s a useful measure and newly added metrics that can be used with code samples.
Clustering - what is it?
Clustering is an unsupervised learning technique. The term unsupervised here refers to the fact that we do not have ground truth targets as we do in classification. The primary goal of clustering is to discover hidden patterns or structures within data without prior knowledge about the meaning or importance of particular features. Thus, clustering is a form of data exploration compared to supervised learning, where the goal is “just” to predict if a data point belongs to one class.
The key goal of clustering algorithms is to split data into clusters/sets where data points from the same cluster are more similar to each other than any other points from the remaining clusters. Some of the most common and widely used clustering algorithms are K-Means, Hierarchical clustering, and Gaussian Mixture Models (GMM).
An objective quality evaluation/measure is required regardless of the clustering algorithm or internal optimization criterion used. In general, we can divide all clustering metrics into two categories: extrinsic metrics and intrinsic metrics.
Extrinsic metrics
Extrinsic metrics are characterized by requirements of some ground truth labeling, even if used for an unsupervised method. This may seem counter-intuitive at first as we, by clustering definition, do not use such ground truth labeling. However, most clustering algorithms are still developed on datasets with labels available, so these metrics use this fact as an advantage.
Intrinsic metrics
In contrast, intrinsic metrics do not need any ground truth information. These metrics estimate inter-cluster consistency (cohesion of all points assigned to a single set) compared to other clusters (separation). This is often done by comparing the distance in the embedding space.
Update to Mean Average Precision
MeanAveragePrecision
, the most widely used metric for object detection in computer vision, now supports two new arguments: average
and backend
.
-
The
average
argument controls averaging over multiple classes. By the core definition, the default way ismacro
averaging, where the metric is calculated for each class separately and then averaged together. This will continue to be the default in Torchmetrics, but now we also support the settingaverage="micro"
. Every object under this setting is essentially considered to be the same class, and the returned value is, therefore, calculated simultaneously over all objects. -
The second argument -
backend
, is important, as it indicates what computational backend will be used for the internal computations. SinceMeanAveragePrecision
is not a simple metric to compute, and we value the correctness of our metric, we rely on some third-party library to do the internal computations. By default, we rely on users to have the official pycocotools installed, but with the new argument, we will also be supporting other backends.
[1.2.0] - 2023-09-22
Added
- Added metric to cluster package:
MutualInformationScore
(#2008)RandScore
(#2025)NormalizedMutualInfoScore
(#2029)AdjustedRandScore
(#2032)CalinskiHarabaszScore
(#2036)DunnIndex
(#2049)HomogeneityScore
(#2053)CompletenessScore
(#2053)VMeasureScore
(#2053)FowlkesMallowsIndex
(#2066)AdjustedMutualInfoScore
(#2058)DaviesBouldinScore
(#2071)
- Added
backend
argument toMeanAveragePrecision
(#2034)
Full Changelog: v1.1.0...v1.2.0
New Contributors since v1.1.0
- @matsumotosan made their first contribution in #2008
- @GlavitsBalazs made their first contribution in #2042
- @OmerShubi made their first contribution in #2081
- @munahaf made their first contribution in #2082
Key Contributors
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Weekly patch release
[1.1.2] - 2023-09-11
Fixed
- Fixed tie breaking in ndcg metric (#2031)
- Fixed bug in
BootStrapper
when very few samples were evaluated that could lead to crash (#2052) - Fixed bug when creating multiple plots that lead to not all plots being shown (#2060)
- Fixed performance issues in
RecallAtFixedPrecision
for large batch sizes (#2042) - Fixed bug related to
MetricCollection
used with custom metrics haveprefix
/postfix
attributes (#2070)
Contributors
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Weekly patch release
[1.1.1] - 2023-08-29
Added
- Added
average
argument toMeanAveragePrecision
(#2018)
Fixed
- Fixed bug in
PearsonCorrCoef
is updated on single samples at a time (#2019) - Fixed support for pixel-wise MSE (#2017)
- Fixed bug in
MetricCollection
when used with multiple metrics that return dicts with same keys (#2027) - Fixed bug in detection intersection metrics when
class_metrics=True
resulting in wrong values (#1924) - Fixed missing attributes
higher_is_better
,is_differentiable
for some metrics (#2028)
Contributors
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Into Generative AI
In version v1.1 of Torchmetrics, in total five new metrics have been added, bringing the total number of metrics up to 128! In particular, we have two new exciting metrics for evaluating your favorite generative models for images.
Perceptual Path length
Introduced in the famous StyleGAN paper back in 2018 the Perceptual path length metric is used to quantify how smoothly a generator manages to interpolate between points in its latent space.
Why does the smoothness of the latent space of your generative model matter? Assume you find an image at some point in your latent space that generates an image you like, but you would like to see if you could find a better one if you slightly change the latent point it was generated from. If your latent space could be smoother, this because very hard because even small changes to the latent point can lead to large changes in the generated image.
CLIP image quality assessment
CLIP image quality assessment (CLIPIQA) is a very recently proposed metric in this paper. The metrics build on the OpenAI CLIP model, which is a multi-modal model for connecting text and images. The core idea behind the metric is that different properties of an image can be assessed by measuring how similar the CLIP embedding of the image is to the respective CLIP embedding of a positive and negative prompt for that given property.
VIF, Edit, and SA-SDR
-
VisualInformationFidelity
has been added to the image package. The first proposed in this paper can be used to automatically assess the quality of images in a perceptual manner. -
EditDistance
have been added to the text package. A very classical metric for text that simply measures the amount of characters that need to be substituted, inserted, or deleted, to transform the predicted text into the reference text. -
SourceAggregatedSignalDistortionRatio
has been added to the audio package. Metric was originally proposed in this paper and is an improvement over the classical Signal-to-Distortion Ratio (SDR) metric (also found in torchmetrics) that provides more stable gradients during training when trying to train models for style source separation.
[1.1.0] - 2022-08-22
Added
- Added source aggregated signal-to-distortion ratio (SA-SDR) metric (#1882
- Added
VisualInformationFidelity
to image package (#1830) - Added
EditDistance
to text package (#1906) - Added
top_k
argument toRetrievalMRR
in retrieval package (#1961) - Added support for evaluating
"segm"
and"bbox"
detection inMeanAveragePrecision
at the same time (#1928) - Added
PerceptualPathLength
to image package (#1939) - Added support for multioutput evaluation in
MeanSquaredError
(#1937) - Added argument
extended_summary
toMeanAveragePrecision
such that precision, recall, iou can be easily returned (#1983) - Added warning to
ClipScore
if long captions are detected and truncate (#2001) - Added
CLIPImageQualityAssessment
to multimodal package (#1931) - Added new property
metric_state
to all metrics for users to investigate currently stored tensors in memory (#2006)
Full Changelog: v1.0.0...v1.1.0
New Contributors since v1.0.0
- @fansuregrin made their first contribution in #1892
- @salcc made their first contribution in #1934
- @IanMaquignaz made their first contribution in #1943
- @kn made their first contribution in #1955
- @Vivswan made their first contribution in #1982
- @njuaplusplus made their first contribution in #1986
Contributors
@bojobo, @lucadiliello, @quancs, @SkafteNicki
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Weekly patch release
[1.0.3] - 2022-08-08
Added
- Added warning to
MeanAveragePrecision
if too many detections are observed (#1978)
Fixed
- Fix support for int input for when
multidim_average="samplewise"
in classification metrics (#1977) - Fixed x/y labels when plotting confusion matrices (#1976)
- Fixed IOU compute in cuda (#1982)
Contributors
@Borda, @SkafteNicki^n
, @Vivswan
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Weekly patch release
[1.0.2] - 2022-08-03
Added
- Added warning to
PearsonCorrCoeff
if input has a very small variance for its given dtype (#1926)
Changed
- Changed all non-task specific classification metrics to be true subtypes of
Metric
(#1963)
Fixed
- Fixed bug in
CalibrationError
where calculations for double precision input was performed in float precision (#1919) - Fixed bug related to the
prefix/postfix
arguments inMetricCollection
andClasswiseWrapper
being duplicated (#1918) - Fixed missing AUC score when plotting classification metrics that support the
score
argument (#1948)
Contributors
If we forgot someone due to not matching commit email with GitHub account, let us know :]