You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The recent NLG metrics are more often based on BERT (or related) embeddings. As such, I believe, we should also start adding such metrics into TorchMetrics with an extra dependency on transformers if a user wants to use any of these metrics. The InfoLM metric is from a family of untrained metrics (i.e. the model is not fine-tuned on any specific task) so it should be easier for us to begin with it. (Any opinion on this? @Borda :] )
Abstract:
Assessing the quality of natural language generation systems through human annotation is very expensive. Additionally, human annotation campaigns are time-consuming and include non-reusable human labour. In practice, researchers rely on automatic metrics as a proxy of quality. In the last decade, many string-based metrics (e.g., BLEU) have been introduced. However, such metrics usually rely on exact matches and thus, do not robustly handle synonyms. In this paper, we introduce InfoLM a family of untrained metrics that can be viewed as a string-based metric that addresses the aforementioned flaws thanks to a pre-trained masked language model. This family of metrics also makes use of information measures allowing the adaptation of InfoLM to various evaluation criteria. Using direct assessment, we demonstrate that InfoLM achieves statistically significant improvement and over 10 points of correlation gains in many configurations on both summarization and data2text generation.
The text was updated successfully, but these errors were encountered:
🚀 Feature
Add
InfoLM
Sources:
Motivation
The recent NLG metrics are more often based on BERT (or related) embeddings. As such, I believe, we should also start adding such metrics into
TorchMetrics
with an extra dependency ontransformers
if a user wants to use any of these metrics. TheInfoLM
metric is from a family of untrained metrics (i.e. the model is not fine-tuned on any specific task) so it should be easier for us to begin with it. (Any opinion on this? @Borda :] )Abstract:
Assessing the quality of natural language generation systems through human annotation is very expensive. Additionally, human annotation campaigns are time-consuming and include non-reusable human labour. In practice, researchers rely on automatic metrics as a proxy of quality. In the last decade, many string-based metrics (e.g., BLEU) have been introduced. However, such metrics usually rely on exact matches and thus, do not robustly handle synonyms. In this paper, we introduce InfoLM a family of untrained metrics that can be viewed as a string-based metric that addresses the aforementioned flaws thanks to a pre-trained masked language model. This family of metrics also makes use of information measures allowing the adaptation of InfoLM to various evaluation criteria. Using direct assessment, we demonstrate that InfoLM achieves statistically significant improvement and over 10 points of correlation gains in many configurations on both summarization and data2text generation.
The text was updated successfully, but these errors were encountered: