Skip to content

[WIP] Add RTEB retrieval code#2529

Closed
fzliu wants to merge 11 commits intoembeddings-benchmark:mainfrom
fzliu:main
Closed

[WIP] Add RTEB retrieval code#2529
fzliu wants to merge 11 commits intoembeddings-benchmark:mainfrom
fzliu:main

Conversation

@fzliu
Copy link

@fzliu fzliu commented Apr 10, 2025

Related: #2517

Code Quality

  • Code Formatted: Format the code using make lint to maintain consistent style.

Documentation

  • Updated Documentation: Add or update documentation to reflect the changes introduced in this PR.

Testing

  • New Tests Added: Write tests to cover new functionality. Validate with make test-with-coverage.
  • Tests Passed: Run tests locally using make test or make test-with-coverage to ensure no existing functionality is broken.

Copy link
Member

@Samoed Samoed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, don't add directly rteb repository here. You should integrate your datasets into mteb. We standartized loading for retrieval tasks in v2 branch

@Samoed Samoed changed the title [WIP] Add retrieval code [WIP] Add RTEB retrieval code Apr 10, 2025
Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea so generally agree with @Samoed here, but let us take it one element at a time.

  1. datasets: Some of the datasets are already integrated in MTEB and we def. don't want duplicates. I must say that I am unsure why we can't add these as regular MTEB Retrieval tasks? Is there anything that I am missing?

  2. Models: All models are already implemented in MTEB. We don't want duplicate models.

These points were also raised in the issue.

@Samoed Samoed marked this pull request as draft April 13, 2025 11:37
* Create hook for RTEB evaluator inside MTEB repository

* Create hook for RTEB evaluator inside MTEB repository

* Removing unused files and adding some changes

* Query/document types

* Refactoring (removing ebr, create rteb Retrieval folder)

* Refactoring (removing ebr, create rteb Retrieval folder)

* Separating the logic

* Using pl.LightningModule for the RTEB eval
input_type=None for the Voyage model (just like in RTEB)

* Storing rteb cached results in a separate folder

* Removing the Models (we'll use the MTEB models)
Add all the remaining RTEB Datasets - with TODOs

* Create new RTEB task type (AbsTaskRTEB)
Refactor all RTEB tasks

* Create new RTEB task type (AbsTaskRTEB)
Refactor all RTEB tasks

* Create new RTEB task type (AbsTaskRTEB)
Refactor all RTEB tasks

* Aggregated task

* Use HFDataLoader!

* Removing the rteb package

* Removing the rteb package

* Made all datasets working

* Correct voyageai model

* First simplifications

* Simplifications

* Simplifications

* Simplifications
@fzliu
Copy link
Author

fzliu commented May 1, 2025

Still WIP but I think this is ready for a first pass.

@Samoed @KennethEnevoldsen to your earlier points, we changed the format of all the datasets to match those of existing retrieval datasets; however, the task is still separate since we're tracking multiple metrics and the format of the output results is a bit different.

We can merge it into the broader retrieval task down the road.

Copy link
Member

@Samoed Samoed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your implementation is quite different from how other tasks are handled. I checked a few datasets and couldn't find zenodo/4063986 used in RTEBAILACasedocsTask, and lavita/ChatDoctor-HealthCareMagic-100k doesn't load with your current loader.

If you need just to add scores to Retrieval to correctly work then, you can use this #2600 for that (this is v2 branch)

return round(model_memory_mb)

@property
def _id(self) -> str:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need this?

model_name="voyage-3-large", # Match the API model name
model_prompts=model_prompts,
),
max_tokens=32000, # Assuming same as voyage-3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
max_tokens=32000, # Assuming same as voyage-3
max_tokens=32768,

model_prompts=model_prompts,
),
max_tokens=32000, # Assuming same as voyage-3
embed_dim=1024, # Assuming same as voyage-3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check?

)
rteb_encoder._trainer = trainer

args = argparse.Namespace(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you use argparse.Namespace?

"enable_checkpointing": False,
"enable_progress_bar": True,
}
trainer = pl.Trainer(**trainer_kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need Trainer if you're only encoding?

import warnings

warnings.warn(
"_evaluate_subset is deprecated for RTEB tasks. Use RTEBTaskRunner.run_rteb_evaluation instead.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

@KennethEnevoldsen
Copy link
Contributor

Hi @fzliu. This PR has currently introduced:

  1. Multiple tasks
  2. A new abstask (the core)
  3. A model
  4. an aggregated task (not needed, should be a benchmark)

I would actually suggest closing this PR and opening two new ones:

  1. with the model
  2. with the RTEB abstask and one sample task

This allows us to focus on the main points. On those main points:

the task is still separate since we're tracking multiple metrics

What metrics are different, why do we add them

and the format of the output results is a bit different.

I prefer keeping the current results format. This works with the current leaderboard, review pipeline, utility functions etc.

Changing the result format would require some justification.

@fzliu fzliu closed this May 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants