@@ -80,12 +80,12 @@ Below are some examples of the currently supported models:
8080| 37 | 0.3B | Alibaba GTE | [ Snowflake/snowflake-arctic-embed-m-v2.0] ( https://hf.co/Snowflake/snowflake-arctic-embed-m-v2.0 ) |
8181| 49 | 0.5B | XLM-RoBERTa | [ intfloat/multilingual-e5-large-instruct] ( https://hf.co/intfloat/multilingual-e5-large-instruct ) |
8282| N/A | 0.4B | Alibaba GTE | [ Alibaba-NLP/gte-large-en-v1.5] ( https://hf.co/Alibaba-NLP/gte-large-en-v1.5 ) |
83+ | N/A | 0.4B | ModernBERT | [ answerdotai/ModernBERT-large] ( https://hf.co/answerdotai/ModernBERT-large ) |
8384| N/A | 0.1B | NomicBert | [ nomic-ai/nomic-embed-text-v1] ( https://hf.co/nomic-ai/nomic-embed-text-v1 ) |
8485| N/A | 0.1B | NomicBert | [ nomic-ai/nomic-embed-text-v1.5] ( https://hf.co/nomic-ai/nomic-embed-text-v1.5 ) |
8586| N/A | 0.1B | JinaBERT | [ jinaai/jina-embeddings-v2-base-en] ( https://hf.co/jinaai/jina-embeddings-v2-base-en ) |
8687| N/A | 0.1B | JinaBERT | [ jinaai/jina-embeddings-v2-base-code] ( https://hf.co/jinaai/jina-embeddings-v2-base-code ) |
8788| N/A | 0.1B | MPNet | [ sentence-transformers/all-mpnet-base-v2] ( https://hf.co/sentence-transformers/all-mpnet-base-v2 ) |
88- | N/A | 0.4B | ModernBERT | [ answerdotai/ModernBERT-large] ( https://hf.co/answerdotai/ModernBERT-large ) |
8989
9090To explore the list of best performing text embeddings models, visit the
9191[ Massive Text Embedding Benchmark (MTEB) Leaderboard] ( https://huggingface.co/spaces/mteb/leaderboard ) .
@@ -109,7 +109,7 @@ Below are some examples of the currently supported models:
109109model=BAAI/bge-large-en-v1.5
110110volume=$PWD /data # share a volume with the Docker container to avoid downloading weights every run
111111
112- docker run --gpus all -p 8080:80 -v $volume :/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.6 --model-id $model
112+ docker run --gpus all -p 8080:80 -v $volume :/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.7 --model-id $model
113113```
114114
115115And then you can make requests like
@@ -319,13 +319,13 @@ Text Embeddings Inference ships with multiple Docker images that you can use to
319319
320320| Architecture | Image |
321321| -------------------------------------| -------------------------------------------------------------------------|
322- | CPU | ghcr.io/huggingface/text-embeddings-inference: cpu-1 .6 |
322+ | CPU | ghcr.io/huggingface/text-embeddings-inference: cpu-1 .7 |
323323| Volta | NOT SUPPORTED |
324- | Turing (T4, RTX 2000 series, ...) | ghcr.io/huggingface/text-embeddings-inference: turing-1 .6 (experimental) |
325- | Ampere 80 (A100, A30) | ghcr.io/huggingface/text-embeddings-inference:1.6 |
326- | Ampere 86 (A10, A40, ...) | ghcr.io/huggingface/text-embeddings-inference:86-1.6 |
327- | Ada Lovelace (RTX 4000 series, ...) | ghcr.io/huggingface/text-embeddings-inference:89-1.6 |
328- | Hopper (H100) | ghcr.io/huggingface/text-embeddings-inference: hopper-1 .6 (experimental) |
324+ | Turing (T4, RTX 2000 series, ...) | ghcr.io/huggingface/text-embeddings-inference: turing-1 .7 (experimental) |
325+ | Ampere 80 (A100, A30) | ghcr.io/huggingface/text-embeddings-inference:1.7 |
326+ | Ampere 86 (A10, A40, ...) | ghcr.io/huggingface/text-embeddings-inference:86-1.7 |
327+ | Ada Lovelace (RTX 4000 series, ...) | ghcr.io/huggingface/text-embeddings-inference:89-1.7 |
328+ | Hopper (H100) | ghcr.io/huggingface/text-embeddings-inference: hopper-1 .7 (experimental) |
329329
330330** Warning** : Flash Attention is turned off by default for the Turing image as it suffers from precision issues.
331331You can turn Flash Attention v1 ON by using the ` USE_FLASH_ATTENTION=True ` environment variable.
@@ -354,7 +354,7 @@ model=<your private model>
354354volume=$PWD /data # share a volume with the Docker container to avoid downloading weights every run
355355token=< your cli READ token>
356356
357- docker run --gpus all -e HF_TOKEN=$token -p 8080:80 -v $volume :/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.6 --model-id $model
357+ docker run --gpus all -e HF_TOKEN=$token -p 8080:80 -v $volume :/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.7 --model-id $model
358358```
359359
360360### Air gapped deployment
@@ -377,7 +377,7 @@ git clone https://huggingface.co/Alibaba-NLP/gte-base-en-v1.5
377377volume=$PWD
378378
379379# Mount the models directory inside the container with a volume and set the model ID
380- docker run --gpus all -p 8080:80 -v $volume :/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.6 --model-id /data/gte-base-en-v1.5
380+ docker run --gpus all -p 8080:80 -v $volume :/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.7 --model-id /data/gte-base-en-v1.5
381381```
382382
383383### Using Re-rankers models
@@ -394,7 +394,7 @@ downstream performance.
394394model=BAAI/bge-reranker-large
395395volume=$PWD /data # share a volume with the Docker container to avoid downloading weights every run
396396
397- docker run --gpus all -p 8080:80 -v $volume :/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.6 --model-id $model
397+ docker run --gpus all -p 8080:80 -v $volume :/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.7 --model-id $model
398398```
399399
400400And then you can rank the similarity between a query and a list of texts with:
@@ -414,7 +414,7 @@ You can also use classic Sequence Classification models like `SamLowe/roberta-ba
414414model=SamLowe/roberta-base-go_emotions
415415volume=$PWD /data # share a volume with the Docker container to avoid downloading weights every run
416416
417- docker run --gpus all -p 8080:80 -v $volume :/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.6 --model-id $model
417+ docker run --gpus all -p 8080:80 -v $volume :/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.7 --model-id $model
418418```
419419
420420Once you have deployed the model you can use the ` predict ` endpoint to get the emotions most associated with an input:
@@ -434,7 +434,7 @@ You can choose to activate SPLADE pooling for Bert and Distilbert MaskedLM archi
434434model=naver/efficient-splade-VI-BT-large-query
435435volume=$PWD /data # share a volume with the Docker container to avoid downloading weights every run
436436
437- docker run --gpus all -p 8080:80 -v $volume :/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.6 --model-id $model --pooling splade
437+ docker run --gpus all -p 8080:80 -v $volume :/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.7 --model-id $model --pooling splade
438438```
439439
440440Once you have deployed the model you can use the ` /embed_sparse ` endpoint to get the sparse embedding:
@@ -463,7 +463,7 @@ You can use the gRPC API by adding the `-grpc` tag to any TEI Docker image. For
463463model=BAAI/bge-large-en-v1.5
464464volume=$PWD /data # share a volume with the Docker container to avoid downloading weights every run
465465
466- docker run --gpus all -p 8080:80 -v $volume :/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.6 -grpc --model-id $model
466+ docker run --gpus all -p 8080:80 -v $volume :/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.7 -grpc --model-id $model
467467```
468468
469469``` shell
0 commit comments