diff --git a/docs/usage/usage.md b/docs/usage/usage.md index 06c5b9c1c3..a5b9319311 100644 --- a/docs/usage/usage.md +++ b/docs/usage/usage.md @@ -41,6 +41,20 @@ results = evaluation.run(model) ``` +## Speeding up evaluations + +Evaluation in MTEB consists of three main components. The download of the dataset, the encoding of the samples, and the evaluation. Typically, the most notable bottleneck are either in the encoding step or on the download step. We discuss how to speed these up in the following sections. + +### Speeding up download + +The fastest way to speed up downloads is by using Huggingface's [`xet`](https://huggingface.co/blog/xet-on-the-hub). You can use this simply using: + +```bash +pip install mteb[xet] +``` + +For one of the larger datasets, `MrTidyRetrieval` (~15 GB), we have seen speed-ups from ~40 minutes to ~30 minutes while using `xet`. + ### Evaluating on Different Modalities MTEB is not only text evaluating, but also allow you to evaluate image and image-text embeddings. diff --git a/pyproject.toml b/pyproject.toml index 113bf1a609..8e6063853d 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -94,6 +94,7 @@ vertexai = ["vertexai==1.71.1"] llm2vec = ["llm2vec>=0.2.3,<0.3.0"] timm = ["timm>=1.0.15,<1.1.0"] open_clip_torch = ["open_clip_torch==2.31.0"] +xet = ["huggingface_hub>=0.32.0"] ark = ["volcengine-python-sdk[ark]==3.0.2", "tiktoken>=0.8.0"] colpali_engine = ["colpali_engine>=0.3.10"]