Skip to content

Common voice#2951

Merged
isaac-chung merged 18 commits intoembeddings-benchmark:maebfrom
hepengfe:common_voice
Aug 2, 2025
Merged

Common voice#2951
isaac-chung merged 18 commits intoembeddings-benchmark:maebfrom
hepengfe:common_voice

Conversation

@hepengfe
Copy link
Member

@hepengfe hepengfe commented Jul 26, 2025

This PR fixes #2050

  • I have outlined why this dataset is filling an existing gap in mteb
  • I have tested that the dataset runs with the mteb package.

An easy way to test it is using:

import mteb
# sample model:
model = mteb.get_model("laion/clap-htsat-unfused")

task = mteb.get_task("CommonVoiceT2ARetrieval")
evaluation = mteb.MTEB(tasks=[task])
evaluation.run(model)
  • I have run the following models on the task (adding the results to the pr). These can be run using the mteb run -m {model_name} -t {task_name} command.
    • laion/clap-htsat-unfused
  • I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
  • I have considered the size of the dataset and reduced it if it is too big (2048 examples is typically large enough for most tasks) - I set the evaluation language to be the one with very small dataset.

@hepengfe hepengfe marked this pull request as ready for review July 26, 2025 18:38
Copy link
Collaborator

@isaac-chung isaac-chung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great start! Looks like you still need to run linting and get tests passing. If you're able to run the task, please attach the results in a comment.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's specify version to be clear. Also, this should inherit from MultilingualTask.

Suggested change
class CommonVoiceA2TRetrieval(AbsTaskAny2AnyRetrieval):
from mteb.abstasks.MultilingualTask import MultilingualTask
class CommonVoice17A2TRetrieval(AbsTaskAny2AnyRetrieval, MultilingualTask):

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
name="CommonVoiceA2TRetrieval",
name="CommonVoice17A2TRetrieval",

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please include all languages.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Included.

@hepengfe
Copy link
Member Author

@isaac-chung Here is the evaluation results for "afrikaans" A2T retrieval.

{
  "dataset_revision": "b10d53980ef166bc24ce3358471c1970d7e6b5ec",
  "task_name": "CommonVoiceA2TRetrieval",
  "mteb_version": "1.21.3",
  "scores": {
    "test": [
      {
        "ndcg_at_1": 0.01613,
        "ndcg_at_3": 0.02631,
        "ndcg_at_5": 0.04573,
        "ndcg_at_10": 0.07229,
        "ndcg_at_20": 0.12053,
        "ndcg_at_100": 0.24477,
        "ndcg_at_1000": 0.24477,
        "map_at_1": 0.01613,
        "map_at_3": 0.02419,
        "map_at_5": 0.03468,
        "map_at_10": 0.04597,
        "map_at_20": 0.05881,
        "map_at_100": 0.07739,
        "map_at_1000": 0.07739,
        "recall_at_1": 0.01613,
        "recall_at_3": 0.03226,
        "recall_at_5": 0.08065,
        "recall_at_10": 0.16129,
        "recall_at_20": 0.35484,
        "recall_at_100": 1.0,
        "recall_at_1000": 1.0,
        "cv_recall_at_1": 0.01613,
        "cv_recall_at_3": 0.03226,
        "cv_recall_at_5": 0.08065,
        "cv_recall_at_10": 0.16129,
        "cv_recall_at_20": 0.35484,
        "cv_recall_at_100": 1.0,
        "cv_recall_at_1000": 1.0,
        "precision_at_1": 0.01613,
        "precision_at_3": 0.01075,
        "precision_at_5": 0.01613,
        "precision_at_10": 0.01613,
        "precision_at_20": 0.01774,
        "precision_at_100": 0.01,
        "precision_at_1000": 0.001,
        "mrr_at_1": 0.016129,
        "mrr_at_3": 0.024194,
        "mrr_at_5": 0.034677,
        "mrr_at_10": 0.045968,
        "mrr_at_20": 0.058809,
        "mrr_at_100": 0.077391,
        "mrr_at_1000": 0.077391,
        "nauc_ndcg_at_1_max": 0.312997,
        "nauc_ndcg_at_1_std": 0.312997,
        "nauc_ndcg_at_1_diff1": 0.096048,
        "nauc_ndcg_at_3_max": 0.001772,
        "nauc_ndcg_at_3_std": 0.001772,
        "nauc_ndcg_at_3_diff1": 0.096048,
        "nauc_ndcg_at_5_max": 0.14011,
        "nauc_ndcg_at_5_std": 0.056122,
        "nauc_ndcg_at_5_diff1": 0.133503,
        "nauc_ndcg_at_10_max": 0.14321,
        "nauc_ndcg_at_10_std": 0.106188,
        "nauc_ndcg_at_10_diff1": 0.13122,
        "nauc_ndcg_at_20_max": 0.106183,
        "nauc_ndcg_at_20_std": 0.026165,
        "nauc_ndcg_at_20_diff1": 0.084842,
        "nauc_ndcg_at_100_max": 0.088356,
        "nauc_ndcg_at_100_std": 0.054598,
        "nauc_ndcg_at_100_diff1": 0.066126,
        "nauc_ndcg_at_1000_max": 0.088356,
        "nauc_ndcg_at_1000_std": 0.054598,
        "nauc_ndcg_at_1000_diff1": 0.066126,
        "nauc_map_at_1_max": 0.312997,
        "nauc_map_at_1_std": 0.312997,
        "nauc_map_at_1_diff1": 0.096048,
        "nauc_map_at_3_max": 0.044828,
        "nauc_map_at_3_std": 0.044828,
        "nauc_map_at_3_diff1": 0.096048,
        "nauc_map_at_5_max": 0.12523,
        "nauc_map_at_5_std": 0.064861,
        "nauc_map_at_5_diff1": 0.117796,
        "nauc_map_at_10_max": 0.110845,
        "nauc_map_at_10_std": 0.091807,
        "nauc_map_at_10_diff1": 0.107052,
        "nauc_map_at_20_max": 0.09885,
        "nauc_map_at_20_std": 0.056171,
        "nauc_map_at_20_diff1": 0.089019,
        "nauc_map_at_100_max": 0.091044,
        "nauc_map_at_100_std": 0.064294,
        "nauc_map_at_100_diff1": 0.075931,
        "nauc_map_at_1000_max": 0.091044,
        "nauc_map_at_1000_std": 0.064294,
        "nauc_map_at_1000_diff1": 0.075931,
        "nauc_recall_at_1_max": 0.312997,
        "nauc_recall_at_1_std": 0.312997,
        "nauc_recall_at_1_diff1": 0.096048,
        "nauc_recall_at_3_max": -0.089256,
        "nauc_recall_at_3_std": -0.089256,
        "nauc_recall_at_3_diff1": 0.096048,
        "nauc_recall_at_5_max": 0.169741,
        "nauc_recall_at_5_std": 0.052166,
        "nauc_recall_at_5_diff1": 0.157699,
        "nauc_recall_at_10_max": 0.193538,
        "nauc_recall_at_10_std": 0.13094,
        "nauc_recall_at_10_diff1": 0.164648,
        "nauc_recall_at_20_max": 0.114459,
        "nauc_recall_at_20_std": -0.005367,
        "nauc_recall_at_20_diff1": 0.077691,
        "nauc_recall_at_100_max": NaN,
        "nauc_recall_at_100_std": NaN,
        "nauc_recall_at_100_diff1": NaN,
        "nauc_recall_at_1000_max": NaN,
        "nauc_recall_at_1000_std": NaN,
        "nauc_recall_at_1000_diff1": NaN,
        "nauc_precision_at_1_max": 0.312997,
        "nauc_precision_at_1_std": 0.312997,
        "nauc_precision_at_1_diff1": 0.096048,
        "nauc_precision_at_3_max": -0.089256,
        "nauc_precision_at_3_std": -0.089256,
        "nauc_precision_at_3_diff1": 0.096048,
        "nauc_precision_at_5_max": 0.169741,
        "nauc_precision_at_5_std": 0.052166,
        "nauc_precision_at_5_diff1": 0.157699,
        "nauc_precision_at_10_max": 0.193538,
        "nauc_precision_at_10_std": 0.13094,
        "nauc_precision_at_10_diff1": 0.164648,
        "nauc_precision_at_20_max": 0.114459,
        "nauc_precision_at_20_std": -0.005367,
        "nauc_precision_at_20_diff1": 0.077691,
        "nauc_precision_at_100_max": 1.0,
        "nauc_precision_at_100_std": 1.0,
        "nauc_precision_at_100_diff1": 1.0,
        "nauc_precision_at_1000_max": NaN,
        "nauc_precision_at_1000_std": NaN,
        "nauc_precision_at_1000_diff1": NaN,
        "nauc_cv_recall_at_1_max": 0.312997,
        "nauc_cv_recall_at_1_std": 0.312997,
        "nauc_cv_recall_at_1_diff1": 0.096048,
        "nauc_cv_recall_at_3_max": -0.089256,
        "nauc_cv_recall_at_3_std": -0.089256,
        "nauc_cv_recall_at_3_diff1": 0.096048,
        "nauc_cv_recall_at_5_max": 0.169741,
        "nauc_cv_recall_at_5_std": 0.052166,
        "nauc_cv_recall_at_5_diff1": 0.157699,
        "nauc_cv_recall_at_10_max": 0.193538,
        "nauc_cv_recall_at_10_std": 0.13094,
        "nauc_cv_recall_at_10_diff1": 0.164648,
        "nauc_cv_recall_at_20_max": 0.114459,
        "nauc_cv_recall_at_20_std": -0.005367,
        "nauc_cv_recall_at_20_diff1": 0.077691,
        "nauc_cv_recall_at_100_max": NaN,
        "nauc_cv_recall_at_100_std": NaN,
        "nauc_cv_recall_at_100_diff1": NaN,
        "nauc_cv_recall_at_1000_max": NaN,
        "nauc_cv_recall_at_1000_std": NaN,
        "nauc_cv_recall_at_1000_diff1": NaN,
        "nauc_mrr_at_1_max": 0.312997,
        "nauc_mrr_at_1_std": 0.312997,
        "nauc_mrr_at_1_diff1": 0.096048,
        "nauc_mrr_at_3_max": 0.044828,
        "nauc_mrr_at_3_std": 0.044828,
        "nauc_mrr_at_3_diff1": 0.096048,
        "nauc_mrr_at_5_max": 0.12523,
        "nauc_mrr_at_5_std": 0.064861,
        "nauc_mrr_at_5_diff1": 0.117796,
        "nauc_mrr_at_10_max": 0.110845,
        "nauc_mrr_at_10_std": 0.091807,
        "nauc_mrr_at_10_diff1": 0.107052,
        "nauc_mrr_at_20_max": 0.09885,
        "nauc_mrr_at_20_std": 0.056171,
        "nauc_mrr_at_20_diff1": 0.089019,
        "nauc_mrr_at_100_max": 0.091044,
        "nauc_mrr_at_100_std": 0.064294,
        "nauc_mrr_at_100_diff1": 0.075931,
        "nauc_mrr_at_1000_max": 0.091044,
        "nauc_mrr_at_1000_std": 0.064294,
        "nauc_mrr_at_1000_diff1": 0.075931,
        "main_score": 0.08065,
        "hf_subset": "default",
        "languages": [
          "af"
        ]
      }
    ]
  },
  "evaluation_time": 25.977766513824463,
  "kg_co2_emissions": null
}

@hepengfe
Copy link
Member Author

hepengfe commented Jul 29, 2025

Here is the evaluation results for "afrikaans" T2A retrieval.

{
  "dataset_revision": "b10d53980ef166bc24ce3358471c1970d7e6b5ec",
  "task_name": "CommonVoiceT2ARetrieval",
  "mteb_version": "1.21.3",
  "scores": {
    "test": [
      {
        "ndcg_at_1": 0.06452,
        "ndcg_at_3": 0.07258,
        "ndcg_at_5": 0.09201,
        "ndcg_at_10": 0.11711,
        "ndcg_at_20": 0.14548,
        "ndcg_at_100": 0.278,
        "ndcg_at_1000": 0.278,
        "map_at_1": 0.06452,
        "map_at_3": 0.06989,
        "map_at_5": 0.08038,
        "map_at_10": 0.09021,
        "map_at_20": 0.09787,
        "map_at_100": 0.11847,
        "map_at_1000": 0.11847,
        "recall_at_1": 0.06452,
        "recall_at_3": 0.08065,
        "recall_at_5": 0.12903,
        "recall_at_10": 0.20968,
        "recall_at_20": 0.32258,
        "recall_at_100": 1.0,
        "recall_at_1000": 1.0,
        "cv_recall_at_1": 0.06452,
        "cv_recall_at_3": 0.08065,
        "cv_recall_at_5": 0.12903,
        "cv_recall_at_10": 0.20968,
        "cv_recall_at_20": 0.32258,
        "cv_recall_at_100": 1.0,
        "cv_recall_at_1000": 1.0,
        "precision_at_1": 0.06452,
        "precision_at_3": 0.02688,
        "precision_at_5": 0.02581,
        "precision_at_10": 0.02097,
        "precision_at_20": 0.01613,
        "precision_at_100": 0.01,
        "precision_at_1000": 0.001,
        "mrr_at_1": 0.064516,
        "mrr_at_3": 0.069892,
        "mrr_at_5": 0.080376,
        "mrr_at_10": 0.090207,
        "mrr_at_20": 0.097866,
        "mrr_at_100": 0.118466,
        "mrr_at_1000": 0.118466,
        "nauc_ndcg_at_1_max": -0.056226,
        "nauc_ndcg_at_1_std": -0.150482,
        "nauc_ndcg_at_1_diff1": -0.089256,
        "nauc_ndcg_at_3_max": -0.09543,
        "nauc_ndcg_at_3_std": -0.179213,
        "nauc_ndcg_at_3_diff1": -0.101761,
        "nauc_ndcg_at_5_max": 0.023788,
        "nauc_ndcg_at_5_std": -0.03952,
        "nauc_ndcg_at_5_diff1": 0.008827,
        "nauc_ndcg_at_10_max": 0.065878,
        "nauc_ndcg_at_10_std": 0.063517,
        "nauc_ndcg_at_10_diff1": 0.003433,
        "nauc_ndcg_at_20_max": -0.023487,
        "nauc_ndcg_at_20_std": 0.082424,
        "nauc_ndcg_at_20_diff1": -0.009566,
        "nauc_ndcg_at_100_max": -0.022806,
        "nauc_ndcg_at_100_std": -0.009379,
        "nauc_ndcg_at_100_diff1": -0.024265,
        "nauc_ndcg_at_1000_max": -0.022806,
        "nauc_ndcg_at_1000_std": -0.009379,
        "nauc_ndcg_at_1000_diff1": -0.024265,
        "nauc_map_at_1_max": -0.056226,
        "nauc_map_at_1_std": -0.150482,
        "nauc_map_at_1_diff1": -0.089256,
        "nauc_map_at_3_max": -0.083367,
        "nauc_map_at_3_std": -0.170373,
        "nauc_map_at_3_diff1": -0.097913,
        "nauc_map_at_5_max": -0.013166,
        "nauc_map_at_5_std": -0.085539,
        "nauc_map_at_5_diff1": -0.035348,
        "nauc_map_at_10_max": 0.006031,
        "nauc_map_at_10_std": -0.024829,
        "nauc_map_at_10_diff1": -0.038193,
        "nauc_map_at_20_max": -0.024821,
        "nauc_map_at_20_std": -0.016865,
        "nauc_map_at_20_diff1": -0.04111,
        "nauc_map_at_100_max": -0.026922,
        "nauc_map_at_100_std": -0.040566,
        "nauc_map_at_100_diff1": -0.041791,
        "nauc_map_at_1000_max": -0.026922,
        "nauc_map_at_1000_std": -0.040566,
        "nauc_map_at_1000_diff1": -0.041791,
        "nauc_recall_at_1_max": -0.056226,
        "nauc_recall_at_1_std": -0.150482,
        "nauc_recall_at_1_diff1": -0.089256,
        "nauc_recall_at_3_max": -0.126794,
        "nauc_recall_at_3_std": -0.202199,
        "nauc_recall_at_3_diff1": -0.111765,
        "nauc_recall_at_5_max": 0.105986,
        "nauc_recall_at_5_std": 0.060673,
        "nauc_recall_at_5_diff1": 0.10727,
        "nauc_recall_at_10_max": 0.175779,
        "nauc_recall_at_10_std": 0.214556,
        "nauc_recall_at_10_diff1": 0.077389,
        "nauc_recall_at_20_max": -0.036852,
        "nauc_recall_at_20_std": 0.235453,
        "nauc_recall_at_20_diff1": 0.034642,
        "nauc_recall_at_100_max": NaN,
        "nauc_recall_at_100_std": NaN,
        "nauc_recall_at_100_diff1": NaN,
        "nauc_recall_at_1000_max": NaN,
        "nauc_recall_at_1000_std": NaN,
        "nauc_recall_at_1000_diff1": NaN,
        "nauc_precision_at_1_max": -0.056226,
        "nauc_precision_at_1_std": -0.150482,
        "nauc_precision_at_1_diff1": -0.089256,
        "nauc_precision_at_3_max": -0.126794,
        "nauc_precision_at_3_std": -0.202199,
        "nauc_precision_at_3_diff1": -0.111765,
        "nauc_precision_at_5_max": 0.105986,
        "nauc_precision_at_5_std": 0.060673,
        "nauc_precision_at_5_diff1": 0.10727,
        "nauc_precision_at_10_max": 0.175779,
        "nauc_precision_at_10_std": 0.214556,
        "nauc_precision_at_10_diff1": 0.077389,
        "nauc_precision_at_20_max": -0.036852,
        "nauc_precision_at_20_std": 0.235453,
        "nauc_precision_at_20_diff1": 0.034642,
        "nauc_precision_at_100_max": 1.0,
        "nauc_precision_at_100_std": 1.0,
        "nauc_precision_at_100_diff1": 1.0,
        "nauc_precision_at_1000_max": NaN,
        "nauc_precision_at_1000_std": NaN,
        "nauc_precision_at_1000_diff1": NaN,
        "nauc_cv_recall_at_1_max": -0.056226,
        "nauc_cv_recall_at_1_std": -0.150482,
        "nauc_cv_recall_at_1_diff1": -0.089256,
        "nauc_cv_recall_at_3_max": -0.126794,
        "nauc_cv_recall_at_3_std": -0.202199,
        "nauc_cv_recall_at_3_diff1": -0.111765,
        "nauc_cv_recall_at_5_max": 0.105986,
        "nauc_cv_recall_at_5_std": 0.060673,
        "nauc_cv_recall_at_5_diff1": 0.10727,
        "nauc_cv_recall_at_10_max": 0.175779,
        "nauc_cv_recall_at_10_std": 0.214556,
        "nauc_cv_recall_at_10_diff1": 0.077389,
        "nauc_cv_recall_at_20_max": -0.036852,
        "nauc_cv_recall_at_20_std": 0.235453,
        "nauc_cv_recall_at_20_diff1": 0.034642,
        "nauc_cv_recall_at_100_max": NaN,
        "nauc_cv_recall_at_100_std": NaN,
        "nauc_cv_recall_at_100_diff1": NaN,
        "nauc_cv_recall_at_1000_max": NaN,
        "nauc_cv_recall_at_1000_std": NaN,
        "nauc_cv_recall_at_1000_diff1": NaN,
        "nauc_mrr_at_1_max": -0.056226,
        "nauc_mrr_at_1_std": -0.150482,
        "nauc_mrr_at_1_diff1": -0.089256,
        "nauc_mrr_at_3_max": -0.083367,
        "nauc_mrr_at_3_std": -0.170373,
        "nauc_mrr_at_3_diff1": -0.097913,
        "nauc_mrr_at_5_max": -0.013166,
        "nauc_mrr_at_5_std": -0.085539,
        "nauc_mrr_at_5_diff1": -0.035348,
        "nauc_mrr_at_10_max": 0.006031,
        "nauc_mrr_at_10_std": -0.024829,
        "nauc_mrr_at_10_diff1": -0.038193,
        "nauc_mrr_at_20_max": -0.024821,
        "nauc_mrr_at_20_std": -0.016865,
        "nauc_mrr_at_20_diff1": -0.04111,
        "nauc_mrr_at_100_max": -0.026922,
        "nauc_mrr_at_100_std": -0.040566,
        "nauc_mrr_at_100_diff1": -0.041791,
        "nauc_mrr_at_1000_max": -0.026922,
        "nauc_mrr_at_1000_std": -0.040566,
        "nauc_mrr_at_1000_diff1": -0.041791,
        "main_score": 0.12903,
        "hf_subset": "default",
        "languages": [
          "af"
        ]
      }
    ]
  },
  "evaluation_time": 24.952935695648193,
  "kg_co2_emissions": null
}

@hepengfe hepengfe requested a review from isaac-chung July 29, 2025 06:50
@isaac-chung
Copy link
Collaborator

The MTEB version shows that you might not have the most up to date fixed CLAP model. But at least it runs 👍 Please make sure all tests passed before you request a review.

@isaac-chung
Copy link
Collaborator

Tests are still failing. Could you see what it is?

Copy link
Collaborator

@isaac-chung isaac-chung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

@hepengfe
Copy link
Member Author

hepengfe commented Aug 2, 2025

@isaac-chung Should I merge now?

@isaac-chung isaac-chung merged commit 54561ed into embeddings-benchmark:maeb Aug 2, 2025
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments