Model Request for BAAI/bge-m3 (XLMRoberta-based Multilingual Embedding Model) #6007

mofanke · 2024-03-12T06:25:08Z

Prerequisites

Please answer the following questions for yourself before submitting an issue.

I am running the latest code. Development is very rapid so there are no tagged versions as of now.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new bug or useful enhancement to share.

Feature Description

Supporting a multilingual embedding.
https://huggingface.co/BAAI/bge-m3

Motivation

There are some differences between multilingual embeddings and BERT

Possible Implementation

sorry, no idea. I tried , seems model arch is same as bert ,but tokenizer is XLMRobertaTokenizer , not bertTokenizer

RoggeOhta · 2024-04-23T01:56:02Z

Also request this model to be supported.

vonjackustc · 2024-05-04T03:25:40Z

Tried to support it, use BertModel & SPM tokenizer.
https://huggingface.co/vonjack/bge-m3-gguf

Tested cosine similarity between "中国" and "中华人民共和国":
bge-m3-f16: 0.9993230772798457
mxbai-embed-large-v1-f16: 0.7287733321223814

vuminhquang · 2024-05-12T12:21:30Z

I got error when using with langchain
"terminate called after throwing an instance of 'std::out_of_range'"

ciekawy · 2024-05-21T14:45:17Z

same here with llama.cpp, the full error:

libc++abi: terminating due to uncaught exception of type std::out_of_range: unordered_map::at: key not found

ciekawy · 2024-05-21T14:54:58Z

the _bert version does not crash, but the the embeddings do not seem to have any sense...

ciekawy · 2024-05-21T15:54:33Z

also tried to follow instructions on https://github.com/PrithivirajDamodaran/blitz-embed but after converting to gguf, getting error:

llama_model_quantize: failed to quantize: key not found in model: bert.context_length

ciekawy · 2024-05-22T17:41:45Z

@vonjackustc can you share params you used with llama.cpp?

github-actions · 2024-07-07T01:07:07Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

theta-lin · 2024-07-13T07:30:44Z

@vonjackustc Same issue with @vuminhquang and @ciekawy when running it using Ollama.

It appears to be that embedding a text containing \n (newline character) would result in the following error:

terminate called after throwing an instance of 'std::out_of_range'
  what():  _Map_base::at

This issue is also brought up here: https://huggingface.co/vonjack/bge-m3-gguf/discussions/3.

BTW, as an alternative, I am using Text Embeddings Inference to run BAAI/bge-m3 now.

ciekawy · 2024-07-13T09:21:24Z

For embeddings I'd say most of the time it's safe if not desired to remove newlines. This may be not so obvious for longer texts but still...

Huoxu69 · 2024-07-26T01:22:33Z

Tried to support it, use BertModel & SPM tokenizer. https://huggingface.co/vonjack/bge-m3-gguf

Tested cosine similarity between "中国" and "中华人民共和国": bge-m3-f16: 0.9993230772798457 mxbai-embed-large-v1-f16: 0.7287733321223814

May I ask how exactly this is accomplished?

mofanke added the enhancement New feature or request label Mar 12, 2024

mofanke mentioned this issue Mar 14, 2024

Add support for BERT embedding models #5423

Merged

github-actions bot added the stale label Apr 12, 2024

github-actions bot removed the stale label Apr 24, 2024

Mimicvat mentioned this issue May 9, 2024

bge-m3 ollama/ollama#4276

Open

cm4ker mentioned this issue Jun 14, 2024

SEHException on Tokenize model. SciSharp/LLamaSharp#791

Closed

github-actions bot added the stale label Jun 22, 2024

github-actions bot closed this as completed Jul 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model Request for BAAI/bge-m3 (XLMRoberta-based Multilingual Embedding Model) #6007

Model Request for BAAI/bge-m3 (XLMRoberta-based Multilingual Embedding Model) #6007

mofanke commented Mar 12, 2024

RoggeOhta commented Apr 23, 2024

vonjackustc commented May 4, 2024 •

edited

Loading

vuminhquang commented May 12, 2024 •

edited

Loading

ciekawy commented May 21, 2024 •

edited

Loading

ciekawy commented May 21, 2024

ciekawy commented May 21, 2024

ciekawy commented May 22, 2024

github-actions bot commented Jul 7, 2024

theta-lin commented Jul 13, 2024 •

edited

Loading

ciekawy commented Jul 13, 2024

Huoxu69 commented Jul 26, 2024

Model Request for BAAI/bge-m3 (XLMRoberta-based Multilingual Embedding Model) #6007

Model Request for BAAI/bge-m3 (XLMRoberta-based Multilingual Embedding Model) #6007

Comments

mofanke commented Mar 12, 2024

Prerequisites

Feature Description

Motivation

Possible Implementation

RoggeOhta commented Apr 23, 2024

vonjackustc commented May 4, 2024 • edited Loading

vuminhquang commented May 12, 2024 • edited Loading

ciekawy commented May 21, 2024 • edited Loading

ciekawy commented May 21, 2024

ciekawy commented May 21, 2024

ciekawy commented May 22, 2024

github-actions bot commented Jul 7, 2024

theta-lin commented Jul 13, 2024 • edited Loading

ciekawy commented Jul 13, 2024

Huoxu69 commented Jul 26, 2024

vonjackustc commented May 4, 2024 •

edited

Loading

vuminhquang commented May 12, 2024 •

edited

Loading

ciekawy commented May 21, 2024 •

edited

Loading

theta-lin commented Jul 13, 2024 •

edited

Loading