Audio statistics by Samoed · Pull Request #3833 · embeddings-benchmark/mteb

Samoed · 2026-01-03T13:22:10Z

I’ve started integrating audio statistics. For now, I’ve come up with this format. Do you have any suggestions?

class AudioStatistics(TypedDict):
    """Class for descriptive statistics for audio.

    Attributes:
        total_audio_seconds_length: Total length of all audio clips in total frames
        min_audio_seconds_length: Minimum length of audio clip in seconds
        average_audio_seconds_length: Average length of audio clip in seconds
        max_audio_seconds_length: Maximum length of audio clip in seconds
        unique_audios: Number of unique audio clips
        average_sampling_rate: Average sampling rate
        sampling_rates: Dict of unique sampling rates and their frequencies
    """

    total_audio_seconds_length: float

    min_audio_seconds_length: float
    average_audio_seconds_length: float
    max_audio_seconds_length: float

    unique_audios: int

    average_sampling_rate: float
    sampling_rates: dict[int, int]

isaac-chung · 2026-01-03T13:38:36Z

When I see length, I think in seconds. I like the frames approach too, and I'd like it spelled out explicitly (num_frames or whatever). I'd like to see:

the max/min/total number of seconds
the unique set of sampling rates (specify unit)

Would love to hear other feedback as well while I read into it a bit more.

Samoed · 2026-01-03T15:44:04Z

Added seconds and sampling rates

isaac-chung

Sorry for adding more. Revisited some papers and maybe we should use the standard measure of audio dataset size.

mteb/types/statistics.py

mteb/types/_encoder_io.py

isaac-chung

Just wanted to align with HF notation + plus some questions.

mteb/abstasks/_statistics_calculation.py

mteb/types/statistics.py

isaac-chung · 2026-01-03T20:13:30Z

mteb/types/statistics.py

+    unique_audios: int
+
+    average_sampling_rate: float
+    sampling_rates: dict[int, int]


Could this just be a unique set of sampling rates? OK either way.

Suggested change

sampling_rates: dict[int, int]

sampling_rates: list[int]

What do you think?

I think it's better to keep dict to show full distribution of different sample rates. If this became a problem, we can easily change to list of ints

Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>

KennethEnevoldsen

Minor things - generally think this looks good (of course Isaac's comments still apply, but nothing more to add)

pyproject.toml

mteb/abstasks/_statistics_calculation.py

Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>

Samoed · 2026-01-08T19:10:59Z

@isaac-chung Can you review this PR?

# Conflicts: # pyproject.toml # uv.lock

isaac-chung · 2026-01-08T20:14:23Z

uv.lock

Judging from the size, does this incorporate changes from #3875 too?

Yes, I used these changes here too

init

c2faf79

Samoed requested review from AdnanElAssadi56, KennethEnevoldsen and isaac-chung January 3, 2026 13:22

Samoed added the audio Audio extension label Jan 3, 2026

update statistics

b801a17

isaac-chung reviewed Jan 3, 2026

View reviewed changes

mteb/types/statistics.py Outdated Show resolved Hide resolved

mteb/types/_encoder_io.py Outdated Show resolved Hide resolved

update statistics

972497c

isaac-chung reviewed Jan 3, 2026

View reviewed changes

This comment has been minimized.

Sign in to view

Myahr208 mentioned this pull request Jan 3, 2026

<Ai/help/gemini> #3839

Closed

This comment has been minimized.

Sign in to view

Update mteb/types/statistics.py

ff5e93a

Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>

KennethEnevoldsen approved these changes Jan 4, 2026

View reviewed changes

pyproject.toml Outdated Show resolved Hide resolved

mteb/abstasks/_statistics_calculation.py Outdated Show resolved Hide resolved

Samoed and others added 3 commits January 4, 2026 16:46

Apply suggestions from code review

1f3d3ff

Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>

merge

9d4937a

update statistics

1f73b0e

Samoed requested review from KennethEnevoldsen and isaac-chung January 7, 2026 13:59

Samoed added 4 commits January 7, 2026 19:37

fix typehints

f05aec7

fix trust remote

0bd9774

Merge branch 'maeb' into audio_statistics

d9a0ad0

fix pyproject

d12af6f

Merge branch 'maeb' into audio_statistics

4fc6511

# Conflicts: # pyproject.toml # uv.lock

isaac-chung reviewed Jan 8, 2026

View reviewed changes

isaac-chung approved these changes Jan 8, 2026

View reviewed changes

Samoed merged commit 3d17dbc into maeb Jan 8, 2026
10 checks passed

Samoed deleted the audio_statistics branch January 8, 2026 20:18

Conversation

Samoed commented Jan 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

isaac-chung commented Jan 3, 2026

Uh oh!

Samoed commented Jan 3, 2026

Uh oh!

isaac-chung left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

isaac-chung left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

isaac-chung Jan 3, 2026

Choose a reason for hiding this comment

Uh oh!

isaac-chung Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

Samoed Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

This comment has been minimized.

This comment has been minimized.

KennethEnevoldsen left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Samoed commented Jan 8, 2026

Uh oh!

isaac-chung Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

Samoed Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

isaac-chung Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Samoed commented Jan 3, 2026 •

edited

Loading

KennethEnevoldsen left a comment •

edited

Loading