Skip to content

Comments

Add filter for model type#3799

Merged
Samoed merged 10 commits intoembeddings-benchmark:mainfrom
ayush1298:add_filter_for_model_type
Dec 30, 2025
Merged

Add filter for model type#3799
Samoed merged 10 commits intoembeddings-benchmark:mainfrom
ayush1298:add_filter_for_model_type

Conversation

@ayush1298
Copy link
Collaborator

@ayush1298 ayush1298 commented Dec 26, 2025

closes #3051
closes #1841

model_type is added in model_meta in PR #3751

Copilot AI review requested due to automatic review settings December 26, 2025 18:02
@ayush1298 ayush1298 marked this pull request as draft December 26, 2025 18:02
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds functionality to filter models by their type (dense, cross-encoder, late-interaction) in the MTEB leaderboard and related APIs. The changes introduce a new model_types parameter across multiple files to enable filtering.

Key Changes:

  • Added model_types parameter to get_model_metas() function for model filtering
  • Added model_types parameter to filter_tasks() function for task filtering
  • Integrated a UI checkbox group in the leaderboard app to select model types
  • Updated all relevant event handlers to propagate the model_types filter

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
mteb/models/get_model_meta.py Added model_types parameter to get_model_metas() with filtering logic based on model_meta.model_type field
mteb/leaderboard/app.py Added UI checkbox for model type selection and threaded the parameter through all filtering functions and event handlers
mteb/filter_tasks.py Added model_types parameter to filter_tasks() with filtering logic based on metadata.model_types field

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@ayush1298 ayush1298 marked this pull request as ready for review December 27, 2025 05:47
Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have added mo

@ayush1298
Copy link
Collaborator Author

ayush1298 commented Dec 27, 2025

Updated Leaderboard:
image

When applying filter:

When Cross-encoder is selected in MTEB(Multilingual, v2) benchmark image
When late-interaction is selected in ViDoRe(v1, v2) benchmark image

Should we also add a column of model_type in the Summary table?

Also, should we add some kind of statistics, like no of models once some model filter is applied(though one can see it directly from the summary table)?

@Samoed
Copy link
Member

Samoed commented Dec 27, 2025

Should we also add a column of model_type in the Summary table?

I don't think so, because there are too many columns in leaderboard.

Also, should we add some kind of statistics, like no of models once some model filter is applied(though one can see it directly from the summary table)?

I don't think that we need this too

@ayush1298
Copy link
Collaborator Author

Not related to this PR:
While running leaderborad using make run-leaderboard :
I am getting lots of warnings, should we remove some of them. I have listed below some of them which I feel not much essential:

  1. https://github.com/embeddings-benchmark/mteb/blob/main/mteb/models/model_meta.py#L514:L516
  2. Missing subsets {'ru'} for split test, logging like these just filled up terminal, can we just combine all missing subsets/splits for each task and show it once, will be less cluttered.
  3. https://github.com/embeddings-benchmark/mteb/blob/main/mteb/models/get_model_meta.py#L135-L136
    Couldn't get why its try to fetch these models, I think these are the models for which we dont have implementation in MTEB. Logs are like:
Could not get source model: "Model 'Qwen/Qwen3-0.6B' not found in MTEB registry. Did you mean: 'Qwen/Qwen3-Embedding-0.6B' or Qwen/Qwen3-Embedding-8B?" in MTEB

Could not get source model: "Model 'NeuML/pubmedbert-base-embeddings' not found in MTEB registry. Did you mean: 'NeuML/pubmedbert-base-embeddings-8M' or NeuML/pubmedbert-base-embeddings-2M?" in MTEB

@ayush1298
Copy link
Collaborator Author

ayush1298 commented Dec 27, 2025

Should we also add a column of model_type in the Summary table?

I don't think so, because there are too many columns in leaderboard.

Anyway, we have a scrollbar there. These is just, so that one can easily know type of model there only

@KennethEnevoldsen
Copy link
Contributor

@ayush1298 I would make a seperate issue on the warnings (but would be nice to clean them up a bit)

@ayush1298
Copy link
Collaborator Author

I have added tests

Comment on lines 130 to 133
@pytest.mark.parametrize("model_meta", mteb.get_model_metas())
def test_model_meta_has_model_type(model_meta: ModelMeta):
"""Test that all models have model_type field."""
assert model_meta.model_type is not None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this does not test the get_model_meta

Copy link
Collaborator Author

@ayush1298 ayush1298 Dec 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, yes, it would just check filled in ModelMeta. Should I change it to below?

@pytest.mark.parametrize("model_type", ["dense", "cross-encoder", "late-interaction"])
def test_get_model_metas_each_model_type(model_type):
    models = mteb.get_model_metas(model_types=[model_type])
    
    for model in models:
        assert model_type in model.model_type

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

@ayush1298
Copy link
Collaborator Author

@KennethEnevoldsen I have added tests

@Samoed
Copy link
Member

Samoed commented Dec 29, 2025

Can you create leaderboard that we can test?

@Samoed Samoed added the leaderboard issues related to the leaderboard label Dec 29, 2025
@ayush1298
Copy link
Collaborator Author

ayush1298 commented Dec 29, 2025

Can you create leaderboard that we can test?

I have forgotten this one. It was 1st to duplicate hf space, but then how I should run it from these branch?

@Samoed
Copy link
Member

Samoed commented Dec 29, 2025

You can set it in dockerfile of your space

@ayush1298
Copy link
Collaborator Author

@Samoed
Copy link
Member

Samoed commented Dec 29, 2025

I think if nothing is selected then there shouldn't be models

image

@Samoed
Copy link
Member

Samoed commented Dec 29, 2025

But overall this is working
image

@ayush1298
Copy link
Collaborator Author

ayush1298 commented Dec 29, 2025

I think if nothing is selected, then there shouldn't be models

Not getting why the filter is not working correctly here. I don't think the filter is wrong here, but somehow not able to find why it's happening.
@Samoed Could you please check it?

@ayush1298 ayush1298 closed this Dec 29, 2025
@ayush1298 ayush1298 reopened this Dec 29, 2025
@Samoed
Copy link
Member

Samoed commented Dec 30, 2025

Yes, I'll check later

@KennethEnevoldsen
Copy link
Contributor

I think if nothing is selected, then there shouldn't be models

I think it is fair to assume that if everything is unselected the filter is not applied (I would leave it as is)

@Samoed
Copy link
Member

Samoed commented Dec 30, 2025

I think it is fair to assume that if everything is unselected the filter is not applied (I would leave it as is)

I'm not sure. I fixed this in last commit eb75377

@KennethEnevoldsen
Copy link
Contributor

I'm not sure. I fixed this in last commit eb75377

So default is all boxes checked?

@Samoed
Copy link
Member

Samoed commented Dec 30, 2025

Yes

@KennethEnevoldsen
Copy link
Contributor

KennethEnevoldsen commented Dec 30, 2025

I don't have a strong opinion here, but we should be consistent with the other selectors. I see that we are currently inconsistent in our radio buttons:

Screenshot 2025-12-30 at 12 14 06

This is probably a separate issue, though.

@ayush1298
Copy link
Collaborator Author

Can we merge this one? @Samoed

@Samoed Samoed merged commit 6f4627e into embeddings-benchmark:main Dec 30, 2025
11 checks passed
@ayush1298 ayush1298 deleted the add_filter_for_model_type branch December 30, 2025 18:35
isaac-chung added a commit that referenced this pull request Jan 7, 2026
* feat: Added `get_benchmark_result()` to BenchmarkResults to obtain a benchmark table (#3771)

* Update BenchmarkResults to output results of benchmark

* added score column and correct TYPE_CHECKING

* address comments

* address comments

* fix import

* fix tests

* fix tests

* change BenchmarkResults to Pydantic dataclass

* change benchmark to pydantic dataclass

* fix tests

* fix model

* fix

* lint

* remove future

* fix after review

* add test

* reapply comments from review

* remove mock benchmark

* add documentation

* added actual results

* Update docs/usage/loading_results.md

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* add actual results

---------

Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com>
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* 2.5.0

Automatically generated by python-semantic-release

* fix: legacy clustering processing (#3791)

fix clustering processing

* 2.5.1

Automatically generated by python-semantic-release

* better clustering fix (#3793)

* docs: update MIEB contributing guide for MTEB v2 AbsTask structure (#3787)

* docs: update MIEB contributing guide for MTEB v2 AbsTask structure

* Update docs/mieb/readme.md

* Update docs/mieb/readme.md

* model: add octen_models (#3789)

* model: add octen_models

* add issue link for document prompt

* Add leaderboard timing logs and join_revisions() speedups (#3790)

* feat: add detailed timing logs to leaderboard initialization

Add comprehensive timing information to track performance of each step
in the leaderboard building process:
- Loading benchmark results (from cache or remote)
- Fetching and processing benchmarks
- Filtering models and generating tables
- Creating Gradio components and interface
- Prerun phase for cache population

Each step logs start and completion times with elapsed duration to help
identify performance bottlenecks during leaderboard initialization.

* perf: optimize benchmark processing with caching and vectorized operations

Implemented 3 high-impact optimizations to reduce benchmark processing time:

1. Cache get_model_metas() calls using @functools.lru_cache
   - Eliminates 59 redundant calls (once per benchmark)
   - Now called once and cached for all benchmarks

2. Replace pandas groupby().apply() with vectorized operations
   - Replaced deprecated .apply(keep_best) pattern
   - Uses sort_values() + groupby().first() instead
   - Avoids nested function calls per group

3. Cache version string parsing with @functools.lru_cache
   - Eliminates redundant parsing of same version strings
   - Uses LRU cache with 10,000 entry limit

Performance improvements:
- Benchmark processing: 131.17s → 44.73s (2.93x faster, 66% reduction)
- join_revisions(): 84.96s → 1.73s (49x faster, 98% reduction)
- Leaderboard Step 3: 121.28s → 48.23s (2.51x faster, 60% reduction)

This significantly improves leaderboard startup time by reducing the
benchmark processing bottleneck.

* Update mteb/leaderboard/app.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* fix: ensure deterministic revision grouping in join_revisions()

- Replace groupby(revision_clean) with groupby(revision)
- Remove non-deterministic iloc[0] access for revision selection
- Tasks with different original revisions (None vs external) now kept separate
- Each ModelResult has consistent revision across all its task_results

This resolves the issue where tasks with different original revisions that mapped
to the same cleaned value would be grouped together non-deterministically.

* refactor: use default lru_cache maxsize for _get_cached_model_metas

* refactor: remove optimization markers from comments

* Apply suggestion from @isaac-chung

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Optimize validate filter scores only (#3792)

* feat: add detailed timing logs to leaderboard initialization

Add comprehensive timing information to track performance of each step
in the leaderboard building process:
- Loading benchmark results (from cache or remote)
- Fetching and processing benchmarks
- Filtering models and generating tables
- Creating Gradio components and interface
- Prerun phase for cache population

Each step logs start and completion times with elapsed duration to help
identify performance bottlenecks during leaderboard initialization.

* perf: optimize benchmark processing with caching and vectorized operations

Implemented 3 high-impact optimizations to reduce benchmark processing time:

1. Cache get_model_metas() calls using @functools.lru_cache
   - Eliminates 59 redundant calls (once per benchmark)
   - Now called once and cached for all benchmarks

2. Replace pandas groupby().apply() with vectorized operations
   - Replaced deprecated .apply(keep_best) pattern
   - Uses sort_values() + groupby().first() instead
   - Avoids nested function calls per group

3. Cache version string parsing with @functools.lru_cache
   - Eliminates redundant parsing of same version strings
   - Uses LRU cache with 10,000 entry limit

Performance improvements:
- Benchmark processing: 131.17s → 44.73s (2.93x faster, 66% reduction)
- join_revisions(): 84.96s → 1.73s (49x faster, 98% reduction)
- Leaderboard Step 3: 121.28s → 48.23s (2.51x faster, 60% reduction)

This significantly improves leaderboard startup time by reducing the
benchmark processing bottleneck.

* perf: optimize validate_and_filter_scores filtering logic

* Update mteb/results/task_result.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* fix: Add model_type in model_meta for all models (#3751)

* Add model_type in model_meta for all models

* added literal for model_type

* update jina embedding model type

* Added model_type to from_cross_encoder() method

* update test

* change location in model_meta to pass test

* update late_interaction model and fix test

* update late_interaction for colnomic models

* update test

* Update mteb/models/model_meta.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* fix naming

* remove is_cross_encoder field and convert it into property

---------

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* 2.5.2

Automatically generated by python-semantic-release

* fix: Added warnings.warn when logging warnings (#3753)

* Added warnings.warn when logging warnings

* address comments

* Added depreciation warning

* made better

* address comments

* address comments

* address comments

---------

Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com>
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>

* 2.5.3

Automatically generated by python-semantic-release

* save kwargs passed to get_model in model_meta (#3785)

* save kwargs passed to get_model in model_meta

* add save_kwargs to load_model

* removed copy of meta

* Update mteb/models/model_meta.py

* try to run with kwargs

* try to move kwargs

* add tests

* change model in tests

---------

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com>

* fix: add typecheck (#3550)

* add pytyped

* start typing

* finish evaluators

* add more types

* Update mteb/results/benchmark_results.py

Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>

* apply comments

* continue typechecking

* fix typehint

* typechecking

* fix tests

* fix type errors again

* fix cache

* add more types

* fix method

* roll back pyproject

* activate PGH

* install more types

* almost finish

* fix search wrappers

* add ci

* fix tests

* fix 3.10 types

* rollback overload

* fixes after merge

* change to iterable

* add fixes

* remove summarization scores hint

* simplify deprecated_evaluator

* simplify model conversion

* add comment for typechecking

* remove casts

* remove duplicated function

---------

Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>

* 2.5.4

Automatically generated by python-semantic-release

* Add benchmark aliases (#3767)

* add benchmark aliases

* split to aliases

* move aliases

* create aliases in separate function

* simplify a bit

* add test

* Apply suggestions from code review

Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>

* add default value

* add MTEB alias

---------

Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>

* Add function for creating mock images  (#3803)

* create function for creating mock tasks

* add annotations

* docs: add benchmark filtering examples (#3805)

* docs: add benchmark filtering examples

* Apply suggestion from @Samoed

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* docs: remove custom benchmarks subsection

* docs: expand filtering section with content tabs

* docs: fix code block indentation in content tabs

* build: include docs deps in dev group

---------

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* update generate_model_card with get_benchmark_result() (#3796)

* update generate_model_card with get_benchmark_result()

* add support for list of benchmarks

* split parameters

* fix type

* generate card

* add tests

* add tests

* add tabulate to test dependencies

* correct tests

---------

Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com>

* Update the API of Bytedance/Seed1.6-embedding-1215 (#3814)

* update reference website of Seed1.6-embedding-1215

* update Bytedance/Seed1.6-embedding-1215 model

* fix: repo exists check (#3813)

* fix repo exists check

* add test

* 2.5.5

Automatically generated by python-semantic-release

* feat: Add leaderboard CLI command (#3802)

* feat: add leaderboard CLI command with cache-path option

* test: add comprehensive tests for leaderboard CLI command

* try to fix install

* fix: lazy-load leaderboard to avoid requiring deps for CLI

* Update mteb/cli/build_cli.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* make lint

* remove AGENTS.md

* move import to top of file

* log the default cache path

* Improve leaderboard tests to verify actual cache paths

Address PR feedback by modifying leaderboard tests to verify the actual
cache paths passed to get_leaderboard_app instead of mocking ResultCache.

- Updated test_leaderboard_custom_cache_path to create real ResultCache instances
  and verify the correct custom cache path is used
- Updated test_leaderboard_default_cache to verify the default cache path is used
- Removed ResultCache mocking in favor of testing actual cache behavior
- Used patch.dict to mock the leaderboard module import while preserving
  real cache functionality

This provides better test coverage by validating that the cache objects
passed to the leaderboard app have the correct paths, as suggested in
PR comment: #3802 (comment)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

* Combine leaderboard cache tests using pytest parametrize

Address PR feedback by combining test_leaderboard_custom_cache_path and
test_leaderboard_default_cache into a single parametrized test.

- Created test_leaderboard_cache_paths with parametrize decorator
- Tests both custom cache path and default cache path scenarios
- Each test case covers different host, port, and share configurations
- Removed redundant test_leaderboard_args as functionality is now covered
  by the parametrized test
- Improved test maintainability by reducing code duplication

This addresses PR comment: #3802 (comment)
"Can be combined with the following test using a parametrize argument"

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

* Update make run-leaderboard to use new CLI and remove app.py main block

Address PR feedback by updating the project to use the new leaderboard CLI:

- Updated Makefile run-leaderboard target to use `python -m mteb leaderboard`
  instead of `python -m mteb.leaderboard.app`
- Removed the `if __name__ == "__main__":` block from mteb/leaderboard/app.py
  as this functionality is now handled by the CLI command

This completes the integration of the new leaderboard CLI command into
the project's build system and removes deprecated direct module execution.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

* feat: add theme and head parameters to leaderboard CLI

* fix: suppress leaderboard warnings on CLI launch

* test: update leaderboard tests for theme and head params

* Revert "Update make run-leaderboard to use new CLI and remove app.py main block"

This reverts commit d4df501.

* Update mteb/cli/build_cli.py

Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>

* docs: update leaderboard CLI usage

* update docs to show defaults

* fix: apply ruff formatting

---------

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Co-authored-by: Claude Sonnet 4 <noreply@anthropic.com>
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>

* 2.6.0

Automatically generated by python-semantic-release

* Add filter for model type (#3799)

* Add filter for model type

* fix literal issue

* fix

* remove white space

* remove logic in filter_tasks

* remove info in leaderboard

* add tests

* update tests

* add default in model types

* fix model filter

---------

Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com>

* add model: bflhc/Octen-Embedding-4B (#3816)

* fix: Download cached results zip from cached-data branch (#3795)

* Optimize leaderboard startup by downloading cached results from cached-data branch

- Modify _load_results() to first try downloading __cached_results.json.gz from the cached-data branch
- Only fallback to full repository clone if the direct download fails
- Add gzip decompression to handle the compressed cache file
- This reduces startup time significantly by avoiding full repo cloning when possible
- Added comprehensive logging to track download progress and fallback behavior

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

* make lint

* Fix leaderboard stability test with enhanced debugging

- Remove prevent_thread_lock=True to keep Gradio process alive
- Add comprehensive exception handling for HTTP, gzip, and file operations
- Optimize test completion with HTTP 200 health checking (300s → ~140s)
- Add detailed logging and warning suppressions for better debugging

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

* Update tests/test_leaderboard.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Add comprehensive tests for leaderboard caching exception handling

- Add 46 unit tests covering HTTP downloads, gzip decompression, file I/O, and JSON validation
- Reorganize leaderboard tests into focused modules for better maintainability
- Update Makefile with improved leaderboard test commands

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

* Increase cached results download size limit to 500MB

The cached results file has grown to ~92.7MB, exceeding the previous 50MB limit.
This change increases the limit to 500MB to accommodate current and future file sizes.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

* Fix leaderboard tests by adding missing dependency to install-for-tests

GitHub Actions were failing because cachetools was not installed during CI test runs.
The leaderboard extra was already defined with cachetools>=5.2.0 but wasn't included
in the install-for-tests target used by CI.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

* Remove LogFlusher functionality from leaderboard app

Addresses PR comment feedback indicating the log flushing optimization
was unnecessary at this stage. Removes:
- LogFlusher class with batching logic
- Global _log_flusher instance
- _flush_logs() wrapper function
- All calls to _flush_logs() throughout the app
- Complete test file test_log_flushing.py

Leaderboard functionality remains unchanged and tests pass.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

* remove _validate_benchmark_json

* Refactor leaderboard caching to use ResultCache and consolidate tests

Move download_cached_results_from_branch to ResultCache class and reduce TestDownloadCachedResultsFromBranch from 23 to 13 test cases while maintaining full coverage.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

* Apply suggestions from code review

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* lint and remove unreachable code

* Move shared test fixtures to parent conftest.py

- Created tests/conftest.py with shared fixtures (mock_benchmark_json,
  mock_invalid_json, mock_gzipped_content) for use across all tests
- Removed duplicate fixtures from tests/test_leaderboard/conftest.py
- Kept leaderboard-specific fixtures in test_leaderboard/conftest.py
- Fixes TestDownloadCachedResultsFromBranch test failures by making
  fixtures accessible to test_result_cache.py

All 25 tests now passing (23 in test_result_cache.py, 2 in test_integration.py)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* make method private

* Fix content type validation test to match implementation behavior

The test_content_type_handling test was expecting warnings for unexpected
content types, but the actual implementation raises exceptions. Updated test
to use pytest.raises() for proper exception validation.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

* update cache based on review comments

* type check

* Remove unused leaderboard_test_config fixture

* fix: remove unused mock_invalid_json fixture

* rm AGENTS/,d

* reduce number of excepts in app.py

---------

Co-authored-by: Claude Sonnet 4 <noreply@anthropic.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* 2.6.1

Automatically generated by python-semantic-release

* ci: Switch CI to use `uv` (#3702)

* use uv to all make commands

* read the docs a bit more...

* try out system flag

* fix: remove redundant pip install uv commands from Makefile

Removes duplicate uv installations that were conflicting with the
properly configured uv from astral-sh/setup-uv GitHub Action.
The GitHub Action already installs and configures uv correctly,
so the Makefile pip installs were overwriting this configuration
and causing "No system Python installation found" errors.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

* fix: remove --system flag from uv pip install commands

The astral-sh/setup-uv GitHub Action configures uv to manage its own
Python installations, not to use system Python. The --system flag
was causing "No system Python installation found" errors because
uv expects to use its managed Python environment.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

* fix: migrate Makefile to use correct uv workflow

- Replace 'uv pip install' with 'uv sync' for dependency management
- Add proper --extra flags for all optional dependencies
- Use 'uv run' for all Python command executions
- Follow official uv GitHub Actions best practices

This aligns with uv's recommended project workflow and should resolve
the CI environment issues we were experiencing.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

* fix: update all GitHub Actions workflows to remove UV_SYSTEM_PYTHON

- Remove UV_SYSTEM_PYTHON: 1 from all workflow files
- Fix documentation.yml to use 'uv sync --group docs' instead of 'uv pip install'
- Fix leaderboard_build.yml to use 'uv sync --extra leaderboard --group dev'
- Ensures consistent uv workflow across all CI jobs

Updated workflows:
- lint.yml
- documentation.yml
- model_loading.yml
- dataset_loading.yml
- leaderboard_build.yml

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

* fix: lint workflow to use correct dependency group

- Change from 'make install' to 'uv sync --group lint' since pre-commit is in the lint group
- Add explicit pre-commit install step
- Use 'uv run' for lint commands (ruff, typos) to ensure proper environment
- Fixes "pre-commit: No such file or directory" error in lint workflow

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

* remove 3.14

* try out 3.14 again with python_full_version

* specify torch version for pylate dep

* try to skip colpali

* try split torch

* Add --no-sync flag and group/extra flags to uv run commands

Address review comments from PR #3702:

1. Add --no-sync to all uv run commands in Makefile for:
   - Faster execution (avoids re-syncing on each command)
   - pip compatibility (users can remove 'uv run' prefix)

2. Add appropriate group/extra flags to uv run commands:
   - test commands: --group test
   - docs commands: --group docs
   - typecheck: --group typing
   - leaderboard: --extra leaderboard

3. Update CI workflows to use --no-sync and appropriate groups:
   - lint.yml: Add --no-sync --group lint to all uv run commands
   - documentation.yml: Add uv run --no-sync --group docs to mkdocs gh-deploy

These changes improve performance while maintaining compatibility for
contributors who prefer using pip directly.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* try removing install block

* add back install block

* remove install block in doc CI without --no-sync

* add uv lock file

* replace install-for-test with just install

* install pre-commit with uv

* fix doc workflow

* address review comments

* remove no-sync from run-leaderboard make command

* remove --no-sync from selected make commands

* update typechecking

* fix type checking

* sync to install

* fix tests

* test pre-commit setup

* remove test file

* fix: separate install and install-for-tests with uv commands

* fix: add leaderboard extra to typecheck command for gradio imports

* fix: add faiss-cpu extra to test targets

* fix: update CI workflows for uv dependency management

* docs: update all documentation for uv migration

- Add uv installation options alongside pip in README.md
- Update installation.md with comprehensive migration guide for contributors
- Add uv context to CONTRIBUTING.md for development setup
- Update all usage docs to include uv alternatives for extras:
  - openai, leaderboard, image, xet, faiss-cpu dependencies
- Fix incorrect extra name: faiss -> faiss-cpu in retrieval_backend.md
- Ensure consistent dual-option approach (pip/uv) throughout documentation

This provides users and contributors with modern, fast uv tooling while
maintaining backward compatibility with existing pip workflows.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4 <noreply@anthropic.com>
Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com>

* fix: handle git lfs content for cached zip file (#3827)

* 2.6.2

Automatically generated by python-semantic-release

* fix: Allow passing device to model (#3812)

* Allow passing device to model

* revert incorrect modification and fix typeerror

* add device to get_model and address comments

* Correct CDEWrapper

* 2.6.3

Automatically generated by python-semantic-release

* fix: Add leaderboard docker workflow (#3828)

* Add GitHub workflow to test leaderboard Dockerfile

- Add .github/workflows/leaderboard_docker.yml workflow that:
  - Builds the Docker image
  - Tests container startup with 6-minute timeout
  - Monitors for container exit codes and failures
  - Shows progress updates during long initialization
  - Validates leaderboard dependencies are available
- Include Dockerfile for leaderboard containerization
- Fix missing typer dependency in leaderboard extras to resolve ModuleNotFoundError
- Update uv.lock with typer dependency

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

* Optimize leaderboard Docker workflow with smart exit conditions

- Add intelligent startup progress detection instead of blind 5.5min wait
- Monitor key milestones: app startup, Step 1/7 completion
- Only exit early on actual completion signals (Gradio server ready, full initialization)
- Hard timeout failure at 5.5min regardless of progress
- Improved logging with 30s progress updates
- Tested with act: reduces wait time from 5.5min to ~2.5min when appropriate

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

* Remove redundant Test container dependencies step

- Dependencies already verified during Docker build process
- Runtime verification handled by smart exit conditions
- Eliminates environment-specific import failures in act testing
- Streamlines workflow to focus on essential container startup validation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

* Fix Docker image availability and add GHCR caching

- Add load: true to ensure built image is available for docker run
- Add GHCR login and push for enhanced caching across workflow runs
- Include commit-specific and latest tags for better cache utilization

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

* Fix Docker Hub push error by separating GHCR push from local testing

- Only push to GHCR to avoid Docker Hub authentication issues
- Pull GHCR image and tag locally for testing
- Maintains same local tag name for test step compatibility

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

* Add free disk space step to prevent storage errors

- Remove unused software installations (~10GB)
- Clean Docker cache before build operations
- Prevents "no space left on device" errors during image pull

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

* Fix duplicate detection messages and improve server response validation

- Add INIT_COMPLETE_DETECTED state tracking to prevent repeated "initialization complete" messages
- Remove Step 1/7 detection logic that was causing duplicate output
- Add server response test after initialization complete detection
- Clean up progress status display and exit conditions

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

* fix: remove redundant typer dependency from leaderboard extras

* fix: only push Docker images from main branch

* fix: use COPY instead of git clone in Dockerfile

---------

Co-authored-by: Claude Sonnet 4 <noreply@anthropic.com>

* 2.6.4

Automatically generated by python-semantic-release

* docs: Fix docs build strict mode errors (#3809)

* fix: resolve mkdocs strict mode errors

* fix: remove duplicate line in installation.md

* build: add --strict flag to mkdocs build

* fix: resolve invalid BibTeX keys in task citations

* feat: filter BibTeX warnings in strict docs build

* Update docs/usage/defining_the_model.md

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* fix: dynamic mkdocs path discovery for CI

* fix: improve docs build script with clear warning counts

* fix: resolve 6 real docs build warnings

* fix: remove broken PylateSearchEncoder reference

* Remove unused build scripts

* docs: wrap multimodal example as code snippet

* fix: export SklearnModelProtocol for docs API

* docs: add API reference for SklearnModelProtocol

* fix: remove SklearnModelProtocol export to avoid circular import

* feat: add SklearnModelProtocol docs with lazy import

* fix: convert Sphinx cross-references to MkDocs syntax for proper linking

Convert :class: Sphinx syntax to [Text][module.path] MkDocs syntax to ensure
cross-references are properly clickable in the generated documentation.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

* chore: remove SklearnModelProtocol docs to avoid circular import

* chore: remove SklearnModelProtocol export from _evaluators

* style: fix indentation in _evaluators/__init__.py

* fix: enable mkdocs build --strict without warnings

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

* fix: rename evaluator to multilabel_classifier to avoid type conflict

* fix: use evaluator_model instead of evaluator to avoid type conflict

- Change parameter name from multilabel_classifier to evaluator_model
- Maintains SklearnModelProtocol type hint as requested in PR review
- Resolves mypy type error by using parent class's evaluator_model field
- Keeps explicit protocol reference in docstring for clarity

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

---------

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Co-authored-by: Claude Sonnet 4 <noreply@anthropic.com>

* model: Add SauerkrautLM-ColPali visual document retrieval models (#3804)

* model: Add SauerkrautLM-ColPali visual document retrieval models

Add inference code and requirements for SauerkrautLM-ColPali visual document retrieval models.

These are multi-vector embedding models based on the ColPali architecture:
- ColQwen3 (Qwen3-VL backbone): 1.7B Turbo, 2B, 4B, 8B variants
- ColLFM2 (LFM2-VL backbone): 450M variant
- ColMinistral3 (Ministral3 backbone): 3B variant

All models produce 128-dimensional embeddings per text/image token and use MaxSim (late interaction) for retrieval scoring.

Model checkpoints:
- https://huggingface.co/VAGOsolutions/SauerkrautLM-ColQwen3-1.7b-Turbo-v0.1
- https://huggingface.co/VAGOsolutions/SauerkrautLM-ColQwen3-2b-v0.1
- https://huggingface.co/VAGOsolutions/SauerkrautLM-ColQwen3-4b-v0.1
- https://huggingface.co/VAGOsolutions/SauerkrautLM-ColQwen3-8b-v0.1
- https://huggingface.co/VAGOsolutions/SauerkrautLM-ColLFM2-450M-v0.1
- https://huggingface.co/VAGOsolutions/SauerkrautLM-ColMinistral3-3b-v0.1

* fix: Address review comments

- Remove loader functions, use classes directly in ModelMeta
- Remove unused get_fused_embeddings method
- Move model.to(device) and model.eval() to base class __init__
- Pass torch_dtype directly to ColMinistral3.from_pretrained

* Update mteb/models/model_implementations/slm_models.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* Update mteb/models/model_implementations/slm_models.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* Update mteb/models/model_implementations/slm_models.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* Update mteb/models/model_implementations/slm_models.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* Update mteb/models/model_implementations/slm_models.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* Update mteb/models/model_implementations/slm_models.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* Update pyproject.toml

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* fix: Update release_date to 2025-12-20

* fix: address review comments - remove partial, add adapted_from and training_datasets

* Update mteb/models/model_implementations/slm_models.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* fix: import COLPALI_CITATION from colpali_models and add model_type

* add training datasets

* fix: remove section headers and use PyPI package instead of Git URL

* fix: resolve merge conflicts and remove section headers

* fix: use COLPALI_TRAINING_DATA for training_datasets

* fix: use exact n_parameters and memory_usage_mb values from HuggingFace

* don't build 3.14

* lint

---------

Co-authored-by: David Golchinfar <d.golchin@web.de>
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com>

* fix dataset generation tags (#3835)

* fix: Extend framework annotations for `ModelMeta` (#3819)

* Update framework and filter based on them

* update ModelMeta of models

* update ModelMeta of models

* update ModelMeta of models

* update ModelMeta of models

* update ModelMeta of models

* update ModelMeta of models

* update ModelMeta

* add csv

* update ModelMeta

* added framework to ModelMeta

* update ModelMeta of models

* update ModelMeta of models

* update framework in ModelMeta of models

* update framework

* update framework

* update framework in ModelMeta

* fix tests

* Add models

* fix tests

* add tags extraction in from_hub()

* fix typecheck

* apply suggestions

* apply suggestions

* keep only static method

* delete csv and script

* 2.6.5

Automatically generated by python-semantic-release

* dataset: Vietnamese VN-MTEB TVPLRetrieval, NanoClimateFEVER-VN, NanoFEVER-VN, NanoDBPedia-VN, NanoNQ-VN, NanoHotpotQA-VN, NanoMSMARCO-VN (#3810)

* [ADD] Vietnamese VN-MTEB TVPLRetrieval, NanoClimateFEVER-VN, NanoFEVER-VN, NanoDBPedia-VN, NanoNQ-VN, NanoHotpotQA-VN, NanoMSMARCO-VN

* [UPDATE] descriptive stats

* [UPDATE] bibtext

* [UPDATE] dataset path

* [UPDATE] nano db pedia retrieval

* [UPDATE] size dataset from 1M corpus to 100k

* [ADD] add note about what's different in nano version

* [ADD] TVPLRetrieval description

* test: Add HF Space Dockerfile using pre-built leaderboard image (#3838)

* Add HF Space Dockerfile using pre-built leaderboard image

Adds a lightweight Dockerfile for HuggingFace Space deployment that uses
the pre-built ghcr.io/embeddings-benchmark/mteb/leaderboard image as base.
Also adds a workflow to test the Dockerfile.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Delete .github/workflows/hf_space_docker.yml

* test: Add CI workflow for HF Space Dockerfile validation

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: MTEB Agent <agent@example.com>

* Update uv.lock

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Fix lint and type errors

- Remove unused LogOnce import from _create_dataloaders.py
- Use specific type ignore codes [arg-type] in mock_tasks.py for PGH003 compliance
- Fix type annotations in classification.py to use Array type instead of np.ndarray
- Remove unused Iterable import from classification.py

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Fix duplicate modalities kwarg in random_baseline ModelMeta

Remove modalities from _common_mock_metadata since each ModelMeta
instance specifies its own modalities, which caused "got multiple
values for keyword argument 'modalities'" error.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix baselines

* invalidate hf cache for maeb

* temporarily skip 3.14 in tests

---------

Co-authored-by: Munot Ayush Sunil <munotayush6@kgpian.iitkgp.ac.in>
Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com>
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
Co-authored-by: semantic-release <semantic-release>
Co-authored-by: bflhc <kunka.xgw@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
Co-authored-by: Quan Yuhan <yuhan_quan@qq.com>
Co-authored-by: Claude Sonnet 4 <noreply@anthropic.com>
Co-authored-by: dgolchin <david.golchinfar@h-brs.de>
Co-authored-by: David Golchinfar <d.golchin@web.de>
Co-authored-by: Bao Loc Pham <67360122+BaoLocPham@users.noreply.github.com>
Co-authored-by: MTEB Agent <agent@example.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

leaderboard issues related to the leaderboard

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Leaderboard: add filters for other types of models (late interaction, sparse) New leaderboard filtering cross-encoders

3 participants