Skip to content

feat: Add .from_hf_hub() and similar methods to ModelMeta#3737

Merged
Samoed merged 25 commits intomainfrom
refactor_meta
Dec 15, 2025
Merged

feat: Add .from_hf_hub() and similar methods to ModelMeta#3737
Samoed merged 25 commits intomainfrom
refactor_meta

Conversation

@Samoed
Copy link
Member

@Samoed Samoed commented Dec 13, 2025

Close #3735
Close #3695
Close #3734

Move _model_meta_from_hf_hub, _model_meta_from_cross_encoder, _model_meta_from_sentence_transformer to ModelMeta class

from sentence_transformers import SentenceTransformer

from mteb.models import ModelMeta

model = SentenceTransformer("Qwen/Qwen3-Embedding-0.6B", device="cpu")
meta = ModelMeta.from_sentence_transformer_model(model)
print(meta.to_dict())
# {'loader_kwargs': {}, 'name': 'Qwen/Qwen3-Embedding-0.6B', 'revision': 'c54f2e6e80b2d7b7de06f51cec4959f6b3e03418', 'release_date': None, 'languages': None, 'n_parameters': 595776512, 'memory_usage_mb': 1136, 'max_tokens': 32768, 'embed_dim': 1024, 'license': 'apache-2.0', 'open_weights': True, 'public_training_code': None, 'public_training_data': None, 'framework': ['Sentence Transformers'], 'reference': None, 'similarity_fn_name': <ScoringFunction.COSINE: 'cosine'>, 'use_instructions': None, 'training_datasets': None, 'adapted_from': None, 'superseded_by': None, 'modalities': ['text'], 'is_cross_encoder': None, 'citation': None, 'contacts': None, 'loader': 'sentence_transformers_loader'}

license=model_license,
framework=frameworks,
training_datasets=None,
similarity_fn_name=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can fetch similarity function from: https://huggingface.co/sentence-transformers/embeddinggemma-300m-medical/blob/main/config_sentence_transformers.json#L24

but if not I would be ok with assuming cosine? (I suspect it is also in the model data).

Feel free to make this a seperate issue

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be easier to just fetch from the model - but if we could avoid loading/downloading the model that would be great

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added from_hub_for_sentence_transformer maybe not best naming

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I don't think that is what we want. Why can't that just be the default for from_hub?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed

"""
from mteb.models import CrossEncoderWrapper

meta = cls.from_hf_hub(model.model.name_or_path, revision, compute_metadata)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
meta = cls.from_hf_hub(model.model.name_or_path, revision, compute_metadata)
meta = cls.from_hf_hub(model.model.name_or_path, revision, compute_metadata)

maybe worth splitting this into an inner _fetch_metadata_from_hub
and a public fetch_from_hub.

Comment on lines +506 to +507
if "API" in self.framework or self.name is None:
return None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Let us add a warning here
  2. Shouldn't this be based on the number of parameters? It could have API tag while also having public weights

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, we have api tag only for private models, so I'm not sure what to do here

Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen Dec 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if it is private, then the number of parameters is None? (We have a few private functions where the number of parameters is public, but then I suppose we could also estimate the memory usage?). Though I see that might be a bit odd

@KennethEnevoldsen KennethEnevoldsen changed the title refactor ModelMeta fix: Add .from_hf_hub() and similar methods to ModelMeta Dec 14, 2025
@KennethEnevoldsen KennethEnevoldsen changed the title fix: Add .from_hf_hub() and similar methods to ModelMeta feat: Add .from_hf_hub() and similar methods to ModelMeta Dec 14, 2025
@Samoed
Copy link
Member Author

Samoed commented Dec 14, 2025

Added to the doc
image

@Samoed
Copy link
Member Author

Samoed commented Dec 14, 2025

FYI This would also close #3695 #3734 (added to devolompent section)

@Samoed Samoed added the enhancement New feature or request label Dec 14, 2025
@Samoed
Copy link
Member Author

Samoed commented Dec 14, 2025

Problem with HF limits again or HF have some problems, but there is nothing in their status page (aws seems fine too)

license=model_license,
framework=frameworks,
training_datasets=None,
similarity_fn_name=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I don't think that is what we want. Why can't that just be the default for from_hub?

Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are at a solid state!

@Samoed Samoed enabled auto-merge (squash) December 15, 2025 21:00
@Samoed Samoed disabled auto-merge December 15, 2025 21:03
@Samoed Samoed enabled auto-merge (squash) December 15, 2025 21:06
@Samoed Samoed merged commit 28e733e into main Dec 15, 2025
10 checks passed
@Samoed Samoed deleted the refactor_meta branch December 15, 2025 21:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

refactor _model_meta_from_hf() and similar to be methods on ModelMeta Add memory_usage_mb to get_model_meta Add reference extraction to get meta

3 participants