Add support for BERT embedding models #5423

iamlemec · 2024-02-08T20:50:48Z

Following discussion in #2872, adds support for BERT model architecture. Built on top of various contributions from @skeskinen, @xyzhang626, and @cebtenzzre. Includes:

New WordPiece tokenizer llm_tokenize_wpm. Needed for slightly different behavior from SentencePiece. On conversion, vocab is mapped from ## subword scheme to ▁ prefix scheme to allow for unified vocab mappings.
New model fields bert.attention.causal that controls whether attention mask is causal or not (default is true). Also tokenizer.ggml.token_type_count which accounts for token type info, though these are tpyically ignored in actual computations.
Addition of build_bert for graph construction. This is fairly standard. The only difference is the pooling layer at the end. Currently it will pool the entire batch. Ideally, it could be made to pool only within sequence.

In terms of which models actually work, the main limitation is tokenization. I have tested with all-MiniLM-L6-v2 and BAAI/bge-*-*-v1.5 (small, base, and large plus en and zh) and they seem to work and the embedding numbers look similar to Huggingface implementations. The newer BAAI/bge-m3 uses a SentencePiece tokenizer, so it should be doable but I haven't tested it.

convert-hf-to-gguf.py

cebtenzzre · 2024-02-08T21:16:54Z

convert-hf-to-gguf.py

+        self.block_count = self.hparams["num_hidden_layers"]
+
+    def set_gguf_parameters(self):
+        # TODO(cebtenzzre): merge with parent class


Note to self: resolve this before merge

have you... have you forgotten about this...

convert-hf-to-gguf.py

Co-authored-by: Jared Van Bortel <[email protected]>

llama.cpp

ggerganov · 2024-02-09T13:12:00Z

In terms of which models actually work, the main limitation is tokenization.

When I was playing with bert.cpp the other day, I noticed some potential problems with the tokenization when using a bge-base model. For example:

./build/bin/main -m models/bge-base-en-v1.5/ggml-model-f16.gguf -p "This is a ggml"

Tokenizes to:

101 -> [CLS]
2023 -> this
2003 -> is
1037 -> a
2290 -> ##g
19968 -> ##ml
102 -> [SEP]

Seems like a g is missing and also there is an extra concat ## in the 2290 token. So the tokenization might need some more work, but this can be improved later

llama.cpp

iamlemec · 2024-02-09T17:00:13Z

When I was playing with bert.cpp the other day, I noticed some potential problems with the tokenization when using a bge-base model.

Ah yeah, that was a bug in bert.cpp that was fixed a few days ago. It's correct in this PR.

iamlemec · 2024-02-10T22:45:06Z

I have batched embedding working now (bert-batched). Basically just matmul an [n_tokens, n_tokens] pooling matrix at the end. It would make more sense for it to be [n_tokens, n_seq_max], but we don't actually know n_seq_max, so this is a worst case scenario. Embeddings can be fetched by seq_id just like with logits using get_embeddings_ith. Updated the embeddings example to split by lines and embed as separate sequences in one batch.

Should I push this to this PR or wait until this goes through and start a new one?

ggerganov · 2024-02-11T10:54:43Z

llama.cpp

-    // the output is always the last tensor in the graph
-    struct ggml_tensor * res = gf->nodes[gf->n_nodes - 1];
-    GGML_ASSERT(strcmp(res->name, "result_output") == 0);
+    // get logits and embeddings
+    struct ggml_tensor * res = ggml_graph_get_tensor(gf, "result_output");
+    struct ggml_tensor * embeddings = ggml_graph_get_tensor(gf, "result_norm");


Using ggml_graph_get_tensor is not recommended here because it will do a strcmp with the entire graph which can become noticeable in terms of speed. For now, we should be "poking" at the last few tensors to find what we need - not great, but will improve in the future

ggerganov

Let's fix the ggml_graph_get_tensor comment and merge. After that, we can look into batching support in separate PR

iamlemec · 2024-02-27T18:23:59Z

It looks like all-mpnet has T5-style relative position embeddings. I don't think those are supported here yet.

mofanke · 2024-03-10T14:52:41Z

i tried BAAI/bge-m3 , but i does not work by now. because the model architectures is XLMRobertaModel not Bert , and "tokenizer_class": "XLMRobertaTokenizer"

cebtenzzre · 2024-03-12T17:33:52Z

i tried BAAI/bge-m3 , but i does not work by now. because the model architectures is XLMRobertaModel not Bert , and "tokenizer_class": "XLMRobertaTokenizer"

You could open a feature request if you haven't already.

@xyzhang626

* BERT model graph construction (build_bert) * WordPiece tokenizer (llm_tokenize_wpm) * Add flag for non-causal attention models * Allow for models that only output embeddings * Support conversion of BERT models to GGUF * Based on prior work by @xyzhang626 and @skeskinen --------- Co-authored-by: Jared Van Bortel <[email protected]> Co-authored-by: Jared Van Bortel <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>

mofanke · 2024-03-14T03:11:39Z

#6007 already done

hiepxanh · 2024-03-17T09:18:50Z

#6007 already done

What do you mean, I think the PR is not support yet?

I try convert to day an see this one?

(llama.cpp) E:\pre-built\llama.cpp>python convert-hf-to-gguf.py models/multilingual-e5-large/
Loading model: multilingual-e5-large
Traceback (most recent call last):
  File "E:\pre-built\llama.cpp\convert-hf-to-gguf.py", line 2073, in <module>
    main()
  File "E:\pre-built\llama.cpp\convert-hf-to-gguf.py", line 2053, in main
    model_class = Model.from_model_architecture(hparams["architectures"][0])
  File "E:\pre-built\llama.cpp\convert-hf-to-gguf.py", line 204, in from_model_architecture
    raise NotImplementedError(f'Architecture {arch!r} not supported!') from None
NotImplementedError: Architecture 'XLMRobertaModel' not supported!

mofanke · 2024-03-18T06:26:37Z

#6007 already done

What do you mean, I think the PR is not support yet?

I try convert to day an see this one?

(llama.cpp) E:\pre-built\llama.cpp>python convert-hf-to-gguf.py models/multilingual-e5-large/
Loading model: multilingual-e5-large
Traceback (most recent call last):
  File "E:\pre-built\llama.cpp\convert-hf-to-gguf.py", line 2073, in <module>
    main()
  File "E:\pre-built\llama.cpp\convert-hf-to-gguf.py", line 2053, in main
    model_class = Model.from_model_architecture(hparams["architectures"][0])
  File "E:\pre-built\llama.cpp\convert-hf-to-gguf.py", line 204, in from_model_architecture
    raise NotImplementedError(f'Architecture {arch!r} not supported!') from None
NotImplementedError: Architecture 'XLMRobertaModel' not supported!

aha， I'm sorry for causing you confusion， i just mean i opened a feature request

In order to get support for BERT based sentence embedding models like BAAI/bge-base-en-v1.5, mixedbread-ai/mxbai-embed-large-v1, or others, update llama.cpp from b1696 (2023-12-12): https://github.com/ggerganov/llama.cpp/releases/tag/b1696 to the current latest release b2581 (2024-03-30): https://github.com/ggerganov/llama.cpp/releases/tag/b2581 BERT support was added to llama.cpp in February 2024: ggml-org/llama.cpp#5423

@xyzhang626

* BERT model graph construction (build_bert) * WordPiece tokenizer (llm_tokenize_wpm) * Add flag for non-causal attention models * Allow for models that only output embeddings * Support conversion of BERT models to GGUF * Based on prior work by @xyzhang626 and @skeskinen --------- Co-authored-by: Jared Van Bortel <[email protected]> Co-authored-by: Jared Van Bortel <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>

jkgenser · 2024-04-02T16:47:48Z

Just tried google bert uncased and raised NotImplementedError: Architecture "BertForMaskedLM" not supported! I probably miss something here.
another model NotImplementedError: Architecture "BertForSequenceClassification" not supported!

These are BERT models that have been pretrained to create embeddings for individual words, but this PR is for BERT models that have been trained to generate embeddings for entire sentences and paragraphs, and those will not produce good results here.

The keyword you are looking for is "SBert", or Sentence Transformers in general (paper, website, HF).

nomic-embed-text-v1 is a good model to start with. Disclosure: I work for Nomic.

bge-base-en-v1.5 is another BERT of similar size.

So if I finetune as bert model for classification task, it would not work to convert it to GGML? I've been watching this work and really excited to be able to deploy my fine-tuned BERT models on llama.cpp

beyondskyway · 2024-04-23T12:25:19Z

#6007 already done

What do you mean, I think the PR is not support yet?
I try convert to day an see this one?

(llama.cpp) E:\pre-built\llama.cpp>python convert-hf-to-gguf.py models/multilingual-e5-large/
Loading model: multilingual-e5-large
Traceback (most recent call last):
  File "E:\pre-built\llama.cpp\convert-hf-to-gguf.py", line 2073, in <module>
    main()
  File "E:\pre-built\llama.cpp\convert-hf-to-gguf.py", line 2053, in main
    model_class = Model.from_model_architecture(hparams["architectures"][0])
  File "E:\pre-built\llama.cpp\convert-hf-to-gguf.py", line 204, in from_model_architecture
    raise NotImplementedError(f'Architecture {arch!r} not supported!') from None
NotImplementedError: Architecture 'XLMRobertaModel' not supported!

aha， I'm sorry for causing you confusion， i just mean i opened a feature request

same with convert https://huggingface.co/maidalun1020/bce-embedding-base_v1/tree/main

ggerganov · 2024-05-13T11:24:55Z

Where is the reference implementation of XLMRobertaModel for models such as https://huggingface.co/intfloat/multilingual-e5-base/tree/main? Would like to add support for these

iamlemec · 2024-05-13T13:31:14Z

I think the original is here at fairseq: https://github.com/facebookresearch/fairseq/blob/main/fairseq/models/roberta. There's also an implementation in transformers: https://github.com/huggingface/transformers/tree/main/src/transformers/models/xlm_roberta. I've actually been looking into XLMRobertaModel to run BAAI/bge-m3, and comparing the transformers implementations, I think the model side is actually identical to BERT.

But there are differences in the tokenization that have driven me slightly mad trying to understand. The model file is called sentencepiece.bpe.model, but it appears to be an actual SentencePiece (unigram) style model, not BPE. Even then the way it handles spaces with metaspaces looks to be a little different from the way SPM works in the current llama.cpp implementation.

ggerganov · 2024-05-13T13:46:23Z

Thanks!

Even then the way it handles spaces with metaspaces looks to be a little different from the way SPM works in the current llama.cpp implementation.

Maybe the "clean_up_tokenization_spaces" parameter is controlling this behaviour?

https://huggingface.co/BAAI/bge-m3/blob/main/tokenizer_config.json#L3

sragrawal · 2024-06-24T19:45:35Z

Hi All, is there any plan to support XLMRobertaModel? https://huggingface.co/intfloat/multilingual-e5-small works very well for multilingual embeddings for its size (https://huggingface.co/spaces/mteb/leaderboard). Please let me know if there if I should open a new issue for this.

iamlemec · 2024-06-25T00:10:22Z

@sragrawal I believe that Unigram support from #8089 will get us most of the way there on the XLMRoberta tokenizer (which is featured in this and others such as BAAI/bge-m3). The main thing is loading and using the trie structure stored in precompiled_charsmap. There may be some additional pretokenization stuff, but that should be easier to handle.

grigohas · 2024-10-09T10:05:50Z

Hello, is there a work flow on how to build and run bert through llama.cpp ?

iacore · 2024-10-09T19:53:04Z

Hello, is there a work flow on how to build and run bert through llama.cpp ?

I wrote about it here. Not sure what "workflow" you are referring to.

grigohas · 2024-10-15T11:54:23Z

Is there a way to use llama.cpp to generate text with bert ?

iacore · 2024-11-03T20:40:30Z

BERT is not an LLM, afaik.

cebtenzzre and others added 7 commits February 6, 2024 17:10

BERT WIP

7286b83

merge from master

ef10d78

it runs; tokenization is messed up; pooling is wrong for multi batches

0051c82

add in wordpiece tokenizer

59c1829

put causal_attn flag in gguf

5f1c21d

Merge remote-tracking branch 'origin/master' into bert

e0e14e3

Merge remote-tracking branch 'upstream/master' into bert

7218c7b

cebtenzzre reviewed Feb 8, 2024

View reviewed changes

iamlemec and others added 2 commits February 8, 2024 17:33

Update convert-hf-to-gguf.py

e3efcf1

Co-authored-by: Jared Van Bortel <[email protected]>

add causal attention gguf key

96d37f8

slaren reviewed Feb 9, 2024

View reviewed changes

llama.cpp Outdated Show resolved Hide resolved

llama.cpp Outdated Show resolved Hide resolved

llama.cpp Outdated Show resolved Hide resolved

llama.cpp Show resolved Hide resolved

cebtenzzre and others added 5 commits February 8, 2024 21:17

use ctx_output for tok_norm of BERT and BLOOM

e78388d

bert : add some missing graph callbacks

b14c457

fix up model sizing and result acquisition

6875808

hard-code token_type = 0

d080beb

Merge branch 'bert' of github.com:iamlemec/llama.cpp into bert

3a1895d

ggerganov reviewed Feb 9, 2024

View reviewed changes

llama.cpp Show resolved Hide resolved

iamlemec and others added 4 commits February 9, 2024 11:53

style fixes

961e98f

undo attempted type_embd simplify

56afb2f

bert : simplify token type embedding access

ab49e9e

flake8 : add W503 to ignore list

6972e7e

ggerganov reviewed Feb 11, 2024

View reviewed changes

minor : code style normalization

8fbefed

ggerganov approved these changes Feb 11, 2024

View reviewed changes

iamlemec added 2 commits February 11, 2024 09:50

avoid use of ggml_graph_get_tensor

e379e8c

Merge branch 'bert' of github.com:iamlemec/llama.cpp into bert

61bab47

iamlemec merged commit 2891c8a into ggml-org:master Feb 11, 2024

fakerybakery mentioned this pull request Mar 1, 2024

BERT Support abetlen/llama-cpp-python#1240

Closed

jordankanter pushed a commit to jordankanter/llama.cpp that referenced this pull request Mar 13, 2024

llama : fix quantization when tensors are missing (ggml-org#5423)

c587c42

howlger mentioned this pull request Apr 1, 2024

[llama.cpp] Update llama.cpp to latest release b2581 (2024-03-30) deepjavalibrary/djl#3055

Closed

hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024

llama : fix quantization when tensors are missing (ggml-org#5423)

ac96068

jeryaiwei mentioned this pull request Apr 4, 2024

rpc error: code = Unknown desc = unimplemented mudler/LocalAI#800

Open

mofosyne added enhancement New feature or request model Model specific Review Complexity : High Generally require indepth knowledge of LLMs or GPUs labels May 13, 2024

daboe01 mentioned this pull request Jun 25, 2024

idea: embeddings should be generated using llama.cpp abgulati/LARS#9

Closed

aindlq mentioned this pull request Mar 14, 2025

Qlever embeddings based similarity search index. ad-freiburg/qlever#1877

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for BERT embedding models #5423

Add support for BERT embedding models #5423

iamlemec commented Feb 8, 2024

cebtenzzre Feb 8, 2024

iacore Jun 29, 2024 •

edited

Loading

ggerganov commented Feb 9, 2024

iamlemec commented Feb 9, 2024

iamlemec commented Feb 10, 2024

ggerganov Feb 11, 2024

ggerganov left a comment

iamlemec commented Feb 27, 2024

mofanke commented Mar 10, 2024

cebtenzzre commented Mar 12, 2024

mofanke commented Mar 14, 2024 •

edited by cebtenzzre

Loading

hiepxanh commented Mar 17, 2024

mofanke commented Mar 18, 2024

jkgenser commented Apr 2, 2024

beyondskyway commented Apr 23, 2024

ggerganov commented May 13, 2024

iamlemec commented May 13, 2024

ggerganov commented May 13, 2024

sragrawal commented Jun 24, 2024

iamlemec commented Jun 25, 2024

grigohas commented Oct 9, 2024

iacore commented Oct 9, 2024 •

edited

Loading

grigohas commented Oct 15, 2024

iacore commented Nov 3, 2024

Add support for BERT embedding models #5423

Add support for BERT embedding models #5423

Conversation

iamlemec commented Feb 8, 2024

cebtenzzre Feb 8, 2024

Choose a reason for hiding this comment

iacore Jun 29, 2024 • edited Loading

Choose a reason for hiding this comment

ggerganov commented Feb 9, 2024

iamlemec commented Feb 9, 2024

iamlemec commented Feb 10, 2024

ggerganov Feb 11, 2024

Choose a reason for hiding this comment

ggerganov left a comment

Choose a reason for hiding this comment

iamlemec commented Feb 27, 2024

mofanke commented Mar 10, 2024

cebtenzzre commented Mar 12, 2024

mofanke commented Mar 14, 2024 • edited by cebtenzzre Loading

hiepxanh commented Mar 17, 2024

mofanke commented Mar 18, 2024

jkgenser commented Apr 2, 2024

beyondskyway commented Apr 23, 2024

ggerganov commented May 13, 2024

iamlemec commented May 13, 2024

ggerganov commented May 13, 2024

sragrawal commented Jun 24, 2024

iamlemec commented Jun 25, 2024

grigohas commented Oct 9, 2024

iacore commented Oct 9, 2024 • edited Loading

grigohas commented Oct 15, 2024

iacore commented Nov 3, 2024

iacore Jun 29, 2024 •

edited

Loading

mofanke commented Mar 14, 2024 •

edited by cebtenzzre

Loading

iacore commented Oct 9, 2024 •

edited

Loading