Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release MMTEB datasets on Hugging Face #2122

Open
NielsRogge opened this issue Feb 21, 2025 · 5 comments
Open

Release MMTEB datasets on Hugging Face #2122

NielsRogge opened this issue Feb 21, 2025 · 5 comments

Comments

@NielsRogge
Copy link

Hi @KennethEnevoldsen 🤗

I'm Niels and work as part of the open-source team at Hugging Face. I discovered your work through Hugging Face's daily papers as yours got featured: https://huggingface.co/papers/2502.13595.
The paper page lets people discuss about your paper and lets them find artifacts about it (your dataset for instance),
you can also claim the paper as yours which will show up on your public profile at HF.

Would you like to host the datasets you've released on https://huggingface.co/datasets?
Hosting on Hugging Face will give you more visibility/enable better discoverability, and will also allow people to do:

from datasets import load_dataset

dataset = load_dataset("your-hf-org-or-username/your-dataset")

If you're down, leaving a guide here: https://huggingface.co/docs/datasets/loading.
We also support Webdataset, useful for image/video datasets: https://huggingface.co/docs/datasets/en/loading#webdataset.

Besides that, there's the dataset viewer which allows people to quickly explore the first few rows of the data in the browser.

Let me know if you're interested/need any guidance.

Kind regards,

Niels

@Samoed
Copy link
Collaborator

Samoed commented Feb 21, 2025

We have uploaded a lot of datasets in our organization https://huggingface.co/mteb and all datasets that presented in MTEB are from Hugging Face

@isaac-chung
Copy link
Collaborator

Hi @NielsRogge !

Thanks for the message. Adding to what @Samoed pointed out, every task has a corresponding huggingface dataset name and revision. I wonder if a collection would also be beneficial to create here? Ultimately, the datasets are grouped by benchmarks under this repo's benchmarks.py file, so the collection might serve as a mirror for those.

We also have an ongoing task to move datasets under other users to the MTEB org on HF. (I'll have to find the link)

@NielsRogge
Copy link
Author

Yes that could be great! I see you have https://huggingface.co/collections/mteb/mmteb-67b74a586236bc839971e8cd but it does not include the datasets so far

@KennethEnevoldsen
Copy link
Contributor

KennethEnevoldsen commented Feb 21, 2025

Thanks for reaching out @NielsRogge 🤗 - love the work on datasets and I think I can speak for everyone when I say that it makes the dataset aspect of MTEB a lot easier. E.g. I don't think any of the datasets are on HF the ease of access is just a very nice thing to have.

We could indeed start moving datasets into a collection - is there a way to do a bulk import for collections? Or automate it, then we could do it for all the benchmarks in MTEB.

Hmm altså MMTEB contains multiple sets of datasets. Would probably create a collection for each and then refer to those collections in the MMTEB collection. However, currently, it does not seem like collections can be a part of collections.

@tomaarsen
Copy link
Member

is there a way to do a bulk import for collections?

Yes, you can do just about everything programmatically via huggingface_hub: https://huggingface.co/docs/huggingface_hub/v0.29.1/en/guides/collections

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants