Use git-aware cache file layout by liuxueyang · Pull Request #2339 · huggingface/sentence-transformers

liuxueyang · 2023-10-24T12:57:58Z

cached_download is deprecated. Use hf_hub_download instead to take advantage of the new cache.

The new cache is introduced in https://github.com/huggingface/huggingface_hub/releases/tag/v0.8.1

`cached_download` is deprecated. Use `hf_hub_download` instead to take advantage of the new cache.

tomaarsen · 2023-11-06T10:19:58Z

Though I'm in favor of moving towards hf_hub_download, this PR does not correctly load models.
E.g.

from sentence_transformers import SentenceTransformer
 
model = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")

results in a warning that No sentence-transformers model found with name sentence-transformers/all-mpnet-base-v2.

Tom Aarsen

liuxueyang · 2023-11-06T10:47:19Z

Though I'm in favor of moving towards hf_hub_download, this PR does not correctly load models. E.g.
from sentence_transformers import SentenceTransformer
 
model = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")

@tomaarsen

I've tested with the model above and it works: (I added some log outputs to make sure it uses my local sentence-transformers repo.)

➜  sentence-transformers git:(new-cache-layout) ✗ python
Python 3.11.0 (main, Oct 13 2023, 15:26:21) [GCC 8.5.0 20210514 (Red Hat 8.5.0-18)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from sentence_transformers import SentenceTransformer
>>model = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")
SentenceTransformer package from workspace
SentenceTransformer constructor model_name_or_path=sentence-transformers/all-mpnet-base-v2, cache_foler=/home/gpadmin/.cache/torch/sentence_transformers
snapshot_download from hub with caching
....

the directory hierarchy is:

➜  sentence-transformers git:(new-cache-layout) ✗ tree ~/.cache/torch
/home/gpadmin/.cache/torch
└── sentence_transformers
   └── sentence-transformers_all-mpnet-base-v2
      ├── 1_Pooling
      ├── models--sentence-transformers--all-mpnet-base-v2
      │  ├── blobs
      │  │  ├── 1c51ab79a2298a340952d3e6012042a9c84bbe4d
      │  │  ├── 2e85f0eac205cf444bdf97ede4935603ca6a0416
      │  │  ├── 4a9bd7067c525d050f6b771c44b0a5e9ba644731
      │  │  ├── 4ac8dde434b7e7919dff242b49154d1af9d1b620
      │  │  ├── 4e09f293dfe90bba49f87cfe7996271f07be2666
      │  │  ├── 4eb670bb4e7f34e9031acec2b86d39e5c921198e
      │  │  ├── 6d34772f5ca361021038b404fb913ec8dc0b1a5a
      │  │  ├── 7c8e194053bc80e27e19bb2125469e4f289ab2b3
      │  │  ├── 20ae1276042f43d1c80f4f7b7f084a8704592c1d
      │  │  ├── 378d4fa393d5eaccf69c437a20f1cda6ac65c14d
      │  │  ├── 952a9b81c0bfd99800fabf352f69c7ccd46c5e43
      │  │  ├── a8fd120b1a0032e70ff3d4b8ab8e46a6d01c2cb08ffe7c007a021c1788928146
      │  │  ├── b9fd4298819da011007a6a4ceb728c860914fc88
      │  │  └── fd1b291129c607e5d49799f87cb219b27f98acdf
      │  └── snapshots
      │     └── 5681fe04da6e48e851d5dd1af673670cdb299753
      │        ├── 1_Pooling
      │        │  └── config.json -> ../../../blobs/4e09f293dfe90bba49f87cfe7996271f07be2666
      │        ├── config.json -> ../../blobs/b9fd4298819da011007a6a4ceb728c860914fc88
      │        ├── config_sentence_transformers.json -> ../../blobs/fd1b291129c607e5d49799f87cb219b27f98acdf
      │        ├── data_config.json -> ../../blobs/2e85f0eac205cf444bdf97ede4935603ca6a0416
      │        ├── modules.json -> ../../blobs/952a9b81c0bfd99800fabf352f69c7ccd46c5e43
      │        ├── pytorch_model.bin -> ../../blobs/a8fd120b1a0032e70ff3d4b8ab8e46a6d01c2cb08ffe7c007a021c1788928146
      │        ├── README.md -> ../../blobs/4a9bd7067c525d050f6b771c44b0a5e9ba644731
      │        ├── sentence_bert_config.json -> ../../blobs/4eb670bb4e7f34e9031acec2b86d39e5c921198e
      │        ├── special_tokens_map.json -> ../../blobs/378d4fa393d5eaccf69c437a20f1cda6ac65c14d
      │        ├── tokenizer.json -> ../../blobs/7c8e194053bc80e27e19bb2125469e4f289ab2b3
      │        ├── tokenizer_config.json -> ../../blobs/20ae1276042f43d1c80f4f7b7f084a8704592c1d
      │        ├── train_script.py -> ../../blobs/4ac8dde434b7e7919dff242b49154d1af9d1b620
      │        └── vocab.txt -> ../../blobs/1c51ab79a2298a340952d3e6012042a9c84bbe4d
      └── tmpenpklpwt

tomaarsen · 2023-11-06T12:27:24Z

The PR does indeed download the model correctly - but it then tries to check if the loaded model is a Sentence Transformer model using if os.path.exists(os.path.join(model_name_or_path, 'modules.json')): with model_name_or_path as "sentence-transformers/all-mpnet-base-v2" in our example. This will be False, as nothing is downloaded to this path.

See for example:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-mpnet-base-v2")
print(model.encode(["This is a test sentence"]).sum())

On `master`:

Results in

0.10015583

On this PR

Results in

No sentence-transformers model found with name sentence-transformers/all-mpnet-base-v2. Creating a new one with MEAN pooling.
0.29182202

Let me know if the embeddings do match for you.

Tom Aarsen

liuxueyang · 2023-11-07T08:40:18Z

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-mpnet-base-v2")
print(model.encode(["This is a test sentence"]).sum())

The reason is that master branch uses the function _load_sbert_model to load model. And it uses the file sentence_bert_config.json to get the value for field max_seq_length, the value is 384.

modules = OrderedDict([('0', Transformer({'max_seq_length': 384, 'do_lower_case': False}) with Transformer model: MPNetModel ), ('1', Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})), ('2', Normalize())])

My branch uses the function _load_auto_model to load model. And it uses the file tokenizer_config.json to set the for the the field max_seq_length, the value is 512.

modules = OrderedDict([('0', Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: MPNetModel ), ('1', Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False}))])

I pushed a commit to fix this bug. In the above example, the outputs are the same on both branches now:

>>> print(model.encode(["This is a test sentence"]).sum())
0.10015613

I don't know why the output is different from yours.

tomaarsen · 2023-11-07T09:39:05Z

_load_auto_model should only be used to load pure transformers models, not SentenceTransformer models, because it will ignore all modules.json information. E.g. what if you're trying to load a model with a Dense layer with your PR, like https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased?

I think #2345 should supersede this PR, it also moves to hf_hub_download, though it doesn't naively download the full repository anymore.

Tom Aarsen

liuxueyang · 2023-11-07T11:18:01Z

Thank you! This PR can be closed.

Use git-aware cache file layout

3d3d8cb

`cached_download` is deprecated. Use `hf_hub_download` instead to take advantage of the new cache.

liuxueyang mentioned this pull request Nov 6, 2023

Support custom huggingface and sentence-transformers cache postgresml/postgresml#1146

Open

Get the directory for the file modules.json

5ca3b6f

Do not need to join repo_id in cache dir

f491b31

liuxueyang closed this Nov 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use git-aware cache file layout#2339

Use git-aware cache file layout#2339
liuxueyang wants to merge 3 commits intohuggingface:masterfrom
liuxueyang:new-cache-layout

liuxueyang commented Oct 24, 2023

Uh oh!

tomaarsen commented Nov 6, 2023

Uh oh!

liuxueyang commented Nov 6, 2023 •

edited

Loading

Uh oh!

tomaarsen commented Nov 6, 2023 •

edited

Loading

Uh oh!

liuxueyang commented Nov 7, 2023

Uh oh!

tomaarsen commented Nov 7, 2023

Uh oh!

liuxueyang commented Nov 7, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

liuxueyang commented Oct 24, 2023

Uh oh!

tomaarsen commented Nov 6, 2023

Uh oh!

liuxueyang commented Nov 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tomaarsen commented Nov 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

On master:

On this PR

Uh oh!

liuxueyang commented Nov 7, 2023

Uh oh!

tomaarsen commented Nov 7, 2023

Uh oh!

liuxueyang commented Nov 7, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

liuxueyang commented Nov 6, 2023 •

edited

Loading

tomaarsen commented Nov 6, 2023 •

edited

Loading

On `master`: