Releases: Lightning-AI/torchmetrics
Improve mAP performance
[0.7.1] - 2022-02-03
Changed
- Used
torch.bucketize
in calibration error whentorch>1.8
for faster computations (#769) - Improve mAP performance (#742)
Fixed
- Fixed check for available modules (#772)
- Fixed Matthews correlation coefficient when the denominator is 0 (#781)
Contributors
@Borda, @ramonemiliani93, @SkafteNicki, @twsl
If we forgot someone due to not matching commit email with GitHub account, let us know :]
New NLP metrics and improved API
We are excited to announce that TorchMetrics v0.7 is now publicly available. This release is pretty significant. It includes several new metrics (mainly for NLP), naming and import changes, general improvements to the API, and some other great features. TorchMetrics thus now has over 60+ metrics, and the package is more user-friendly than ever.
NLP metrics - Text package
Text package is a part of TorchMetrics as of v0.5. With the growing capability of language generation models, there is also a real need to have reliable evaluation metrics. With several added metrics and unified API, TorchMetrics makes the usage of various metrics even easier! TorchMetrics v0.7 newly includes a couple of machine translation metrics such as chrF, chrF++, Translation Edit Rate, or Extended Edit Distance. Furthermore, it also supports other metrics - Match Error Rate, Word Information Lost, Word Information Preserved, and SQuAD evaluation metrics. Last but not least, we also made possible the evaluation of the ROUGE score using multiple references.
Argument unification
Importantly, all text metrics assume preds, target input order with these explicit keyword arguments. If different naming was used before v0.7, it is deprecated and completely removed in v0.8.
Import and naming changes
TorchMetrics v0.7 brings more extensive and minor changes to how metrics should be imported. The import changes directly impact v0.7, meaning that you will most likely need to change the import statement for some specific metrics. All naming changes follow our standard deprecation process, meaning that in v0.7, any metric that is renamed will still work but raise an error asking to use the new metric name. From v0.8, the old metric names will no longer be available.
[0.7.0] - 2022-01-17
Added
- Added NLP metrics:
- Added
MultiScaleSSIM
into image metrics (#679) - Added Signal to Distortion Ratio (
SDR
) to audio package (#565) - Added
MinMaxMetric
to wrappers (#556) - Added
ignore_index
to retrieval metrics (#676) - Added support for multi references in
ROUGEScore
(#680) - Added a default VSCode devcontainer configuration (#621)
Changed
- Scalar metrics will now consistently have additional dimensions squeezed (#622)
- Metrics having third party dependencies removed from global import (#463)
- Untokenized for
BLEUScore
input stay consistent with all the other text metrics (#640) - Arguments reordered for
TER
,BLEUScore
,SacreBLEUScore
,CHRFScore
now the expected input order is predictions first and target second (#696) - Changed dtype of metric state from
torch.float
totorch.long
inConfusionMatrix
to accommodate larger values (#715) - Unify
preds
,target
input argument's naming across all text metrics (#723, #727)bert
,bleu
,chrf
,sacre_bleu
,wip
,wil
,cer
,ter
,wer
,mer
,rouge
,squad
Deprecated
- Renamed IoU -> Jaccard Index (#662)
- Renamed text WER metric: (#714)
functional.wer
->functional.word_error_rate
WER
->WordErrorRate
- Renamed correlation coefficient classes: (#710)
MatthewsCorrcoef
->MatthewsCorrCoef
PearsonCorrcoef
->PearsonCorrCoef
SpearmanCorrcoef
->SpearmanCorrCoef
- Renamed audio STOI metric: (#753, #758)
audio.STOI
toaudio.ShortTimeObjectiveIntelligibility
functional.audio.stoi
tofunctional.audio.short_time_objective_intelligibility
- Renamed audio PESQ metrics: (#751)
functional.audio.pesq
->functional.audio.perceptual_evaluation_speech_quality
audio.PESQ
->audio.PerceptualEvaluationSpeechQuality
- Renamed audio SDR metrics: (#711)
functional.sdr
->functional.signal_distortion_ratio
functional.si_sdr
->functional.scale_invariant_signal_distortion_ratio
SDR
->SignalDistortionRatio
SI_SDR
->ScaleInvariantSignalDistortionRatio
- Renamed audio SNR metrics: (#712)
functional.snr
->functional.signal_distortion_ratio
functional.si_snr
->functional.scale_invariant_signal_noise_ratio
SNR
->SignalNoiseRatio
SI_SNR
->ScaleInvariantSignalNoiseRatio
- Renamed F-score metrics: (#731, #740)
functional.f1
->functional.f1_score
F1
->F1Score
functional.fbeta
->functional.fbeta_score
FBeta
->FBetaScore
- Renamed Hinge metric: (#734)
functional.hinge
->functional.hinge_loss
Hinge
->HingeLoss
- Renamed image PSNR metrics (#732)
functional.psnr
->functional.peak_signal_noise_ratio
PSNR
->PeakSignalNoiseRatio
- Renamed image PIT metric: (#737)
functional.pit
->functional.permutation_invariant_training
PIT
->PermutationInvariantTraining
- Renamed image SSIM metric: (#747)
functional.ssim
->functional.scale_invariant_signal_noise_ratio
SSIM
->StructuralSimilarityIndexMeasure
- Renamed detection
MAP
toMeanAveragePrecision
metric (#754) - Renamed Fidelity & LPIPS image metric: (#752)
image.FID
->image.FrechetInceptionDistance
image.KID
->image.KernelInceptionDistance
image.LPIPS
->image.LearnedPerceptualImagePatchSimilarity
Removed
- Removed
embedding_similarity
metric (#638) - Removed argument
concatenate_texts
fromwer
metric (#638) - Removed arguments
newline_sep
anddecimal_places
fromrouge
metric (#638)
Fixed
- Fixed MetricCollection kwargs filtering when no
kwargs
are present in update signature (#707)
Contributors
@ashutoshml, @Borda, @cuent, @Fariborzzz, @getgaurav2, @janhenriklambrechts, @justusschock, @karthikrangasai, @lucadiliello, @mahinlma, @mathemusician, @mona0809, @mrleu, @puhuk, @quancs, @SkafteNicki, @stancld, @twsl
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Fixing mAP on GPU
[0.6.2] - 2021-12-15
Fixed
- Fixed
torch.sort
currently does not support booldtype
on CUDA (#665) - Fixed mAP properly checks if ground truths are empty (#684)
- Fixed initialization of tensors to be on the correct device for
MAP
metric (#673)
Contributors
@OlofHarrysson, @tkupek, @twsl
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Own mAP implementation
[0.6.1] - 2021-12-06
Changed
- Migrate MAP metrics from pycocotools to PyTorch (#632)
- Use
torch.topk
instead oftorch.argsort
in retrieval precision for speedup (#627)
Fixed
- Fix empty predictions in MAP metric (#594, #610, #624)
- Fix edge case of AUROC with
average=weighted
on GPU (#606) - Fixed
forward
in compositional metrics (#645)
Contributors
@Callidior, @SkafteNicki, @tkupek, @twsl, @zuoxingdong
If we forgot someone due to not matching commit email with GitHub account, let us know :]
More metrics than ever
[0.6.0] - 2021-10-28
We are excited to announce that Torchmetrics v0.6 is now publicly available. TorchMetrics v0.6 does not focus on specific domains but adds a ton of new metrics to several domains, thus increasing the number of metrics in the repository to over 60! Not only have v0.6 added metrics within already covered domains, but we also add support for two new: Pairwise metrics and detection.
https://devblog.pytorchlightning.ai/torchmetrics-v0-6-more-metrics-than-ever-e98c3983621e
Pairwise Metrics
TorchMetrics v0.6 offers a new set of metrics in its functional backend for calculating pairwise distances. Given a tensor X
with shape [N,d]
(N
observations, each in d
dimensions), a pairwise metric calculates [N,N]
matrix of all possible combinations between the rows of X
.
Detection
TorchMetrics v0.6 now includes a detection package that provides for the MAP metric. The implementation essentially wraps pycocotools
around securing that we get the correct value, but with the benefit of now being able to scale to multiple devices (as any other metric in TorchMetrics).
New additions
-
In the
audio
package, we have two new metrics: Perceptual Evaluation of Speech Quality (PESQ) and Short Term Objective Intelligibility (STOI). Both metrics can be used to assert speech quality. -
In the
retrieval
package, we also have two new metrics: R-precision and Hit-rate. R-precision corresponds to recall at the R-th position of the query. The hit rate is the ratio of the total number of hits returned as a result of a query (hits) to the total number of hits returned. -
The
text
package also receives an update in the form of two new metrics: Sacre BLEU score and character error rate. Sacre BLUE score provides and more systematic way of comparing BLUE scores across tasks. The character error rate is similar to the word error rate but instead calculates if a given algorithm has correctly predicted a sentence based on a character-by-character comparison. -
The
regression
package got a single new metric in the form of the Tweedie deviance score metric. Deviance scores are generally a better measure of fit than measures such as squared error when trying to model data coming from highly screwed distributions. -
Finally, we have added five new metrics for simple aggregation:
SumMetric
,MeanMetric
,MinMetric
,MaxMetric
,CatMetric
. All five metrics take in a single input (either native python floats ortorch.Tensor
) and keep track of the sum, average, min, etc. These new aggregation metrics are especially useful in combination with self.log from lightning if you want to log something other than the average of the metric you are tracking.
Detail changes
Added
- Added audio metrics:
- Added Information retrieval metrics:
- Added NLP metrics:
- Added other metrics:
- Added
MAP
(mean average precision) metric to new detection package (#467) - Added support for float targets in
nDCG
metric (#437) - Added
average
argument toAveragePrecision
metric for reducing multi-label and multi-class problems (#477) - Added
MultioutputWrapper
(#510) - Added metric sweeping:
- Added simple aggregation metrics:
SumMetric
,MeanMetric
,CatMetric
,MinMetric
,MaxMetric
(#506) - Added pairwise submodule with metrics (#553)
pairwise_cosine_similarity
pairwise_euclidean_distance
pairwise_linear_similarity
pairwise_manhatten_distance
Changed
AveragePrecision
will now as default output themacro
average for multilabel and multiclass problems (#477)half
,double
,float
will no longer change the dtype of the metric states. Usemetric.set_dtype
instead (#493)- Renamed
AverageMeter
toMeanMetric
(#506) - Changed
is_differentiable
from property to a constant attribute (#551) ROC
andAUROC
will no longer throw an error when either the positive or negative class is missing. Instead, return 0 scores and give a warning
Deprecated
- Deprecated
torchmetrics.functional.self_supervised.embedding_similarity
in favour of new pairwise submodule
Removed
- Removed
dtype
property (#493)
Fixed
- Fixed bug in
F1
withaverage='macro'
andignore_index!=None
(#495) - Fixed bug in
pit
by using the returned first result to initialize device and type (#533) - Fixed
SSIM
metric using too much memory (#539) - Fixed bug where
device
property was not properly updated when the metric was a child of a module (#542)
Contributors
@an1lam, @Borda, @karthikrangasai, @lucadiliello, @mahinlma, @obus, @quancs, @SkafteNicki, @stancld, @tkupek
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Own NLP implementations
[0.5.1] - 2021-08-30
Added
- Added
device
anddtype
properties (#462) - Added
TextTester
class for robustly testing text metrics (#450)
Changed
- Added support for float targets in
nDCG
metric (#437)
Removed
- Removed
rouge-score
as dependency for text package (#443) - Removed
jiwer
as dependency for text package (#446) - Removed
bert-score
as dependency for text package (#473)
Fixed
- Fixed ranking of samples in
SpearmanCorrCoef
metric (#448) - Fixed bug where compositional metrics where unable to sync because of type mismatch (#454)
- Fixed metric hashing (#478)
- Fixed
BootStrapper
metrics not working on GPU (#462) - Fixed the semantic ordering of kernel height and width in
SSIM
metric (#474)
Contributors
@justusschock, @karthikrangasai, @kingyiusuen, @obus, @SkafteNicki, @stancld
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Text-related (NLP) metrics
[0.5.0] - 2021-08-09
This release includes general improvements to the library and new metrics within the NLP domain.
https://devblog.pytorchlightning.ai/torchmetrics-v0-5-nlp-metrics-f4232467b0c5
Natural language processing is arguably one of the most exciting areas of machine learning, with models such as BERT, ROBERTA, GPT-3 etc., really pushing what automated text translation, recognition, and generation systems are capable of.
With the introduction of these models, many metrics have been proposed that measure how well these models perform. TorchMetrics v0.5 includes 4 such metrics: BERT score, BLEU, ROUGE and WER.
Detail changes
Added
- Added Text-related (NLP) metrics:
- Added
MetricTracker
wrapper metric for keeping track of the same metric over multiple epochs (#238) - Added other metrics:
- Added support in
nDCG
metric for target with values larger than 1 (#349) - Added support for negative targets in
nDCG
metric (#378) - Added
None
as reduction option inCosineSimilarity
metric (#400) - Allowed passing labels in (n_samples, n_classes) to
AveragePrecision
(#386)
Changed
- Moved
psnr
andssim
fromfunctional.regression.*
tofunctional.image.*
(#382) - Moved
image_gradient
fromfunctional.image_gradients
tofunctional.image.gradients
(#381) - Moved
R2Score
fromregression.r2score
toregression.r2
(#371) - Pearson metric now only store 6 statistics instead of all predictions and targets (#380)
- Use
torch.argmax
instead oftorch.topk
whenk=1
for better performance (#419) - Moved check for number of samples in R2 score to support single sample updating (#426)
Deprecated
- Rename
r2score
>>r2_score
andkldivergence
>>kl_divergence
infunctional
(#371) - Moved
bleu_score
fromfunctional.nlp
tofunctional.text.bleu
(#360)
Removed
- Removed restriction that
threshold
has to be in (0,1) range to support logit input (#351, #401) - Removed restriction that
preds
could not be bigger thannum_classes
to support logit input (#357) - Removed module
regression.psnr
andregression.ssim
(#382): - Removed (#379):
- function
functional.mean_relative_error
num_thresholds
argument inBinnedPrecisionRecallCurve
- function
Fixed
- Fixed bug where classification metrics with
average='macro'
would lead to wrong result if a class was missing (#303) - Fixed
weighted
,multi-class
AUROC computation to allow for 0 observations of some class, as contribution to final AUROC is 0 (#376) - Fixed that
_forward_cache
and_computed
attributes are also moved to the correct device if metric is moved (#413) - Fixed calculation in
IoU
metric when usingignore_index
argument (#328)
Contributors
@BeyondTheProof, @Borda, @CSautier, @discort, @edwardclem, @gagan3012, @hugoperrin, @karthikrangasai, @paul-grundmann, @quancs, @rajs96, @SkafteNicki, @vatch123
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Fixing DDP sync
Multimedia - audio & image quality
Overview
https://devblog.pytorchlightning.ai/torchmetrics-v0-4-introducing-multimedia-metrics-e6380a3ad354
Audio
The first highlight of v0.4.0 is a set of 3 new metrics for calculating for evaluating audio data: Scale-invariant signal-to-distortion ratio, Scale-invariant signal-to-noise ratio, and signal-to-noise ratio. All these metrics take a predicted audio tensor and a target tensor, both with the shape [...,time]
and calculate the metric over the time axis.
Image
Version v0.4.0 also includes a completely new image package. Since its initial 0.2.0 release, Torchmetrics has had both PSNR and SSIM in its regression module, metrics that can be used to evaluate image quality.
With the image module, we are adding three new metrics for evaluating the quality of generative models (such as GANS): Inception score (IS), Fréchet inception distance (FID) and kernel inception distance (KID).
More Functionality
In addition to the new audio and image package, we also want to highlight a couple of features:
- Addition of MeanAbsolutePercentageError (MAPE) metric to the regression package. Useful in regression settings where you want to focus on the relative instead of absolute error.
- Addition of KLDivergence metric to the classification package. Useful for measuring the distance between probability distributions like the ones outputted in variational auto-encoders.
- Addition of CosineSimilarity metric to the regression package. Useful for calculating the angle between two embedding vectors in domains such as metric learning.
- As requested by multiple users, Accuracy, Precision, Recall, FBeta, F1, StatScore, Hamming, ConfusionMatrix now directly support that predictions can be unnormalized, e.g. logits from your model. No need to call
.softmax(dim=-1)
anymore! - All modular metrics now have both a
sync
andsync_context
methods that allow the user full control over when metric states are synced. Note that we still automatically do this whenever calling thecompute
method. - The
is_differentiable
property has been adopted by many more of our metrics!
Thanks
Big thanks to all community members for their contributions and feedback.
A special thanks to @quancs for leading the development of the new audio package.
[0.4.0] - 2021-06-24
Added
- Added Cosine Similarity metric (#305)
- Added Specificity metric (#210)
- Added
add_metrics
method toMetricCollection
for adding additional metrics after initialization (#221) - Added pre-gather reduction in the case of
dist_reduce_fx="cat"
to reduce communication cost (#217) - Added better error message for
AUROC
whennum_classes
is not provided for multiclass input (#244) - Added support for unnormalized scores (e.g. logits) in
Accuracy
,Precision
,Recall
,FBeta
,F1
,StatScore
,Hamming
,ConfusionMatrix
metrics (#200) - Added
MeanAbsolutePercentageError(MAPE)
metric. (#248) - Added
squared
argument toMeanSquaredError
for computingRMSE
(#249) - Added FID metric (#213)
- Added
is_differentiable
property toConfusionMatrix
,F1
,FBeta
,Hamming
,Hinge
,IOU
,MatthewsCorrcoef
,Precision
,Recall
,PrecisionRecallCurve
,ROC
,StatScores
(#253) - Added audio metrics: SNR, SI_SDR, SI_SNR (#292)
- Added Inception Score metric to image module (#299)
- Added KID metric to image module (#301)
- Added
sync
andsync_context
methods for manually controlling when metric states are synced (#302) - Added
KLDivergence
metric (#247)
Changed
- Forward cache is reset when
reset
method is called (#260) - Improved per-class metric handling for imbalanced datasets for
precision
,recall
,precision_recall
,fbeta
,f1
,accuracy
, andspecificity
(#204) - Decorated
torch.jit.unused
toMetricCollection
forward (#307) - Renamed
thresholds
argument to binned metrics for manually controlling the thresholds (#322)
Deprecated
- Deprecated
torchmetrics.functional.mean_relative_error
(#248) - Deprecated
num_thresholds
argument inBinnedPrecisionRecallCurve
(#322)
Removed
- Removed argument
is_multiclass
(#319)
Fixed
- AUC can also support more dimensional inputs when all but one dimension are of size 1 (#242)
- Fixed
dtype
of modular metrics after reset has been called (#243) - Fixed calculation in
matthews_corrcoef
to correctly match formula (#321)
Contributors
@AnselmC, @arvindmuralie77, @bhadreshpsavani, @Borda, @GiannisVagionakis, @hassiahk, @IgorHoholko, @johannespitz, @justusschock, @maximsch2, @pranjaldatta, @quancs, @simran2905, @SkafteNicki, @tchaton
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Minor patch release
[0.3.2] - 2021-05-10
Added
- Added
is_differentiable
property:
Changed
MetricCollection
should return metrics with prefix onitems()
,keys()
(#209)- Calling
compute
beforeupdate
will now give an warning (#164)
Removed
- Removed
numpy
as dependency (#212)
Fixed
- Fixed auc calculation and add tests (#197)
- Fixed loading persisted metric states using
load_state_dict()
(#202) - Fixed
PSNR
not working withDDP
(#214) - Fixed metric calculation with unequal batch sizes (#220)
- Fixed metric concatenation for list states for zero-dim input (#229)
- Fixed numerical instability in
AUROC
metric for large input (#230)
Contributors
@bhadreshpsavani, @hlin09, @maximsch2, @SkafteNicki, @tchaton
If we forgot someone due to not matching commit email with GitHub account, let us know :]