Skip to content

Use git-aware cache file layout#2339

Closed
liuxueyang wants to merge 3 commits intohuggingface:masterfrom
liuxueyang:new-cache-layout
Closed

Use git-aware cache file layout#2339
liuxueyang wants to merge 3 commits intohuggingface:masterfrom
liuxueyang:new-cache-layout

Conversation

@liuxueyang
Copy link
Copy Markdown

cached_download is deprecated. Use hf_hub_download instead to take advantage of the new cache.

The new cache is introduced in https://github.com/huggingface/huggingface_hub/releases/tag/v0.8.1

`cached_download` is deprecated. Use `hf_hub_download` instead to take
advantage of the new cache.
@tomaarsen
Copy link
Copy Markdown
Member

Though I'm in favor of moving towards hf_hub_download, this PR does not correctly load models.
E.g.

from sentence_transformers import SentenceTransformer
 
model = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")

results in a warning that No sentence-transformers model found with name sentence-transformers/all-mpnet-base-v2.

  • Tom Aarsen

@liuxueyang
Copy link
Copy Markdown
Author

liuxueyang commented Nov 6, 2023

Though I'm in favor of moving towards hf_hub_download, this PR does not correctly load models. E.g.

from sentence_transformers import SentenceTransformer
 
model = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")

@tomaarsen

I've tested with the model above and it works: (I added some log outputs to make sure it uses my local sentence-transformers repo.)

➜  sentence-transformers git:(new-cache-layout) ✗ python
Python 3.11.0 (main, Oct 13 2023, 15:26:21) [GCC 8.5.0 20210514 (Red Hat 8.5.0-18)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from sentence_transformers import SentenceTransformer
>>model = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")
SentenceTransformer package from workspace
SentenceTransformer constructor model_name_or_path=sentence-transformers/all-mpnet-base-v2, cache_foler=/home/gpadmin/.cache/torch/sentence_transformers
snapshot_download from hub with caching
....

the directory hierarchy is:

➜  sentence-transformers git:(new-cache-layout) ✗ tree ~/.cache/torch
/home/gpadmin/.cache/torch
└── sentence_transformers
   └── sentence-transformers_all-mpnet-base-v2
      ├── 1_Pooling
      ├── models--sentence-transformers--all-mpnet-base-v2
      │  ├── blobs
      │  │  ├── 1c51ab79a2298a340952d3e6012042a9c84bbe4d
      │  │  ├── 2e85f0eac205cf444bdf97ede4935603ca6a0416
      │  │  ├── 4a9bd7067c525d050f6b771c44b0a5e9ba644731
      │  │  ├── 4ac8dde434b7e7919dff242b49154d1af9d1b620
      │  │  ├── 4e09f293dfe90bba49f87cfe7996271f07be2666
      │  │  ├── 4eb670bb4e7f34e9031acec2b86d39e5c921198e
      │  │  ├── 6d34772f5ca361021038b404fb913ec8dc0b1a5a
      │  │  ├── 7c8e194053bc80e27e19bb2125469e4f289ab2b3
      │  │  ├── 20ae1276042f43d1c80f4f7b7f084a8704592c1d
      │  │  ├── 378d4fa393d5eaccf69c437a20f1cda6ac65c14d
      │  │  ├── 952a9b81c0bfd99800fabf352f69c7ccd46c5e43
      │  │  ├── a8fd120b1a0032e70ff3d4b8ab8e46a6d01c2cb08ffe7c007a021c1788928146
      │  │  ├── b9fd4298819da011007a6a4ceb728c860914fc88
      │  │  └── fd1b291129c607e5d49799f87cb219b27f98acdf
      │  └── snapshots
      │     └── 5681fe04da6e48e851d5dd1af673670cdb299753
      │        ├── 1_Pooling
      │        │  └── config.json -> ../../../blobs/4e09f293dfe90bba49f87cfe7996271f07be2666
      │        ├── config.json -> ../../blobs/b9fd4298819da011007a6a4ceb728c860914fc88
      │        ├── config_sentence_transformers.json -> ../../blobs/fd1b291129c607e5d49799f87cb219b27f98acdf
      │        ├── data_config.json -> ../../blobs/2e85f0eac205cf444bdf97ede4935603ca6a0416
      │        ├── modules.json -> ../../blobs/952a9b81c0bfd99800fabf352f69c7ccd46c5e43
      │        ├── pytorch_model.bin -> ../../blobs/a8fd120b1a0032e70ff3d4b8ab8e46a6d01c2cb08ffe7c007a021c1788928146
      │        ├── README.md -> ../../blobs/4a9bd7067c525d050f6b771c44b0a5e9ba644731
      │        ├── sentence_bert_config.json -> ../../blobs/4eb670bb4e7f34e9031acec2b86d39e5c921198e
      │        ├── special_tokens_map.json -> ../../blobs/378d4fa393d5eaccf69c437a20f1cda6ac65c14d
      │        ├── tokenizer.json -> ../../blobs/7c8e194053bc80e27e19bb2125469e4f289ab2b3
      │        ├── tokenizer_config.json -> ../../blobs/20ae1276042f43d1c80f4f7b7f084a8704592c1d
      │        ├── train_script.py -> ../../blobs/4ac8dde434b7e7919dff242b49154d1af9d1b620
      │        └── vocab.txt -> ../../blobs/1c51ab79a2298a340952d3e6012042a9c84bbe4d
      └── tmpenpklpwt

@tomaarsen
Copy link
Copy Markdown
Member

tomaarsen commented Nov 6, 2023

The PR does indeed download the model correctly - but it then tries to check if the loaded model is a Sentence Transformer model using if os.path.exists(os.path.join(model_name_or_path, 'modules.json')): with model_name_or_path as "sentence-transformers/all-mpnet-base-v2" in our example. This will be False, as nothing is downloaded to this path.

See for example:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-mpnet-base-v2")
print(model.encode(["This is a test sentence"]).sum())

On master:

Results in

0.10015583

On this PR

Results in

No sentence-transformers model found with name sentence-transformers/all-mpnet-base-v2. Creating a new one with MEAN pooling.
0.29182202

Let me know if the embeddings do match for you.

  • Tom Aarsen

@liuxueyang
Copy link
Copy Markdown
Author

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-mpnet-base-v2")
print(model.encode(["This is a test sentence"]).sum())

The reason is that master branch uses the function _load_sbert_model to load model. And it uses the file sentence_bert_config.json to get the value for field max_seq_length, the value is 384.

modules = OrderedDict([('0', Transformer({'max_seq_length': 384, 'do_lower_case': False}) with Transformer model: MPNetModel ), ('1', Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})), ('2', Normalize())])

My branch uses the function _load_auto_model to load model. And it uses the file tokenizer_config.json to set the for the the field max_seq_length, the value is 512.

modules = OrderedDict([('0', Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: MPNetModel ), ('1', Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False}))])

I pushed a commit to fix this bug. In the above example, the outputs are the same on both branches now:

>>> print(model.encode(["This is a test sentence"]).sum())
0.10015613

I don't know why the output is different from yours.

@tomaarsen
Copy link
Copy Markdown
Member

_load_auto_model should only be used to load pure transformers models, not SentenceTransformer models, because it will ignore all modules.json information. E.g. what if you're trying to load a model with a Dense layer with your PR, like https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased?

I think #2345 should supersede this PR, it also moves to hf_hub_download, though it doesn't naively download the full repository anymore.

  • Tom Aarsen

@liuxueyang
Copy link
Copy Markdown
Author

Thank you! This PR can be closed.

@liuxueyang liuxueyang closed this Nov 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants