Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama : add BERT support #2872

Closed
ggerganov opened this issue Aug 29, 2023 · 42 comments
Closed

llama : add BERT support #2872

ggerganov opened this issue Aug 29, 2023 · 42 comments
Labels
model Model specific

Comments

@ggerganov
Copy link
Owner

There is a working bert.cpp implementation.
We should try to implement this in llama.cpp and update the embedding example to use it.

The implementation should follow mostly what we did to integrate Falcon.
Here are the main steps:

  • Update gguf.py with BERT arch KV pairs and tensors
  • Python convert script using gguf.py to generate F16 model
  • add tokenizer implementation in llama.cpp
  • add function to build BERT graph
  • add any new ops in ggml if needed
  • add CUDA offloading
  • add tokenizer tests
@dranger003
Copy link
Contributor

@monatis Happy to help with testing or any other tasks you may need help with. Just keep in mind my understanding of LLM architecture is quite limited.

@m0saan
Copy link

m0saan commented Sep 1, 2023

@ggerganov Is this claimed by somebody? otherwise I can give this a shot!

@ggerganov
Copy link
Owner Author

I think @monatis is already working on it

@monatis
Copy link
Collaborator

monatis commented Sep 3, 2023

Yes I'm on the work, I'll raise a PR early this week.

@nortekax
Copy link

nortekax commented Oct 10, 2023

@monatis did this ever happen? I did not see a PR, RAG (retrieval augmented generation) is a major application for which bert models are heavily used. It would be good to add bert model support.

@ggerganov
Copy link
Owner Author

Field report: https://bloop.ai/blog/gpu_with_ggml

@monatis
Copy link
Collaborator

monatis commented Oct 16, 2023

Ouch, after that I was diving deep into LLaVA. Now that it's shipped, I can go back to this one.

@yunghoy
Copy link

yunghoy commented Oct 24, 2023

Maybe m0saan can take care of this?

@CyrilPeponnet
Copy link

While on it maybe worth adding https://jina.ai/news/jina-ai-launches-worlds-first-open-source-8k-text-embedding-rivaling-openai/ :p doesn’t seems to be a regular Bert model though see https://huggingface.co/jinaai/jina-bert-implementation/tree/main

@xbotter
Copy link

xbotter commented Nov 12, 2023

Is there any progress?

@m0saan
Copy link

m0saan commented Nov 12, 2023

Hello @yunghoy, Is someone working on this, otherwise I'll work on it! Thankies

@nortekax
Copy link

@m0saan, I don't think anyone is working on it; originally it was assigned to monatis, but it never happened; so my humble opinion is that it will be great if you work on it.

@wsxiaoys
Copy link
Contributor

wsxiaoys commented Nov 22, 2023

Hi @m0saan, could you please share the implementation process? I'm also interested in the topic and am willing to help with testing or implementing

@snowyu
Copy link
Contributor

snowyu commented Nov 28, 2023

References: HF's tokenizers wrapper tokenizers-cpp .

@obezzad
Copy link

obezzad commented Dec 3, 2023

I can take care of the integration if no one is working on it.

@dranger003
Copy link
Contributor

https://github.com/FFengIll/embedding.cpp

@sandangel
Copy link

@obezzad , thank you so much. That would be awesome. I can help with testing when a PR is ready.

@Lurrobert
Copy link

Lurrobert commented Dec 19, 2023

@obezzad, @monatis any updates?

@xyzhang626
Copy link

xyzhang626 commented Dec 20, 2023

Hey guys, I made some progress based on the fork of bert.cpp, please check it out if you are still interested. Support the SOTA BGE series model, real batch inference, multilingual tokenizer and no 3rd party dependencies. embeddings.cpp

@snowyu
Copy link
Contributor

snowyu commented Dec 20, 2023

@xyzhang626 all tokenizer test failed after using models/bge-small-zh-v1.5/ggml-model-q4_0.bin

// examples/test_tokenizer.cpp
int main(int argc, char ** argv) {

    bert_params params;
    params.model = "models/bge-base-zh-v1.5/ggml-model-q4_0.bin";
    ...

@xyzhang626
Copy link

hey @snowyu it's because that test is written for the all-MiniLM-L6-v2 instead of bge-small-zh-v1.5. It actually works well and I've pushed a script for a better test, refer to this instruction.

To not pollute this thread, we can discuss more details in this post.

@snowyu
Copy link
Contributor

snowyu commented Dec 28, 2023

The Pre-Tokenizer component in Hugging Face Transformers is highly versatile, offering various types that can be dynamically applied to text before it's tokenized by the main Tokenizer class.

Currently, bert.cpp or embeddings.cpp implements only one type of pre-processing WordPiece and uses this method exclusively for all models based on BERT architecture. However, relying solely on WordPiece may not be sufficient or appropriate in certain scenarios where other types of pre-tokenization are required to handle specific text patterns more effectively.

To achieve a pure C++ implementation that fully leverages the capabilities provided by Hugging Face's Tokenizers library and its tokenizer_config.json file, we need to implement all available tokenizers in addition to reading this configuration file correctly. This would allow us to apply different pre-processing steps as specified for each model checkpoint directory within LLaMA or other similar benchmarking tools.

For example, consider the tokenizer.json from the paraphrase-multilingual-MiniLM Pre-trainedLLM:

{
  ...,
  "pre_tokenizer": {
      "type": "Sequence",
       "pretokenizers": [
          {"type": "WhitespaceSplit"},
          {"type": "Metaspace","replacement": "▁", ...}
        ]
    }
}

This configuration file specifies a sequence of pre-tokenization steps that should be applied to the text before it's tokenized. Implementing such configurations in C++ would require parsing this JSON structure and applying each step accordingly, ensuring compatibility with Hugging Face Tokenizers library while maintaining performance efficiency within llama.cpp.

For a clearer example of how these pre-tokenization steps might be implemented using JavaScript code in one file, you can refer to the source for transformers.js's tokenizer implementation: https://github.com/xenova/transformers.js/blob/main/src/tokenizers.js

snowyu added a commit to snowyu/llama.cpp that referenced this issue Jan 26, 2024
The content of the OBJ type is actually a list of all key names of the object. This change includes several improvements and additions to the codebase:

* GGUFWriter:
  * Added `def add_kv(self, key: str, val: Any) -> None` method: Automatically determines the appropriate value type based on val.
  * Added `def add_dict(self, key: str, val: dict) -> None` method: add object(dict) key-value
* constants:
  * Revised `GGUFValueType.get_type(val)`: Added support for numpy's integers and floating point numbers, and appropriately selected the number of digits according to the size of the integer.
* gguf_reader
  * Added `ReaderField.get()` method: get the value of this ReaderField
* Unit tests have been added to cover these changes.

Related Issues: ggerganov#4868, ggerganov#2872
snowyu added a commit to snowyu/llama.cpp that referenced this issue Jan 26, 2024
The content of the OBJ type is actually a list of all key names of the object.

* GGUFWriter:
  * add `def add_kv(self, key: str, val: Any) -> None`:  This will be added based on the val type
  * add `def add_dict(self, key: str, val: dict) -> None`: add object(dict) value
* constants:
  * `GGUFValueType.get_type`: Added support for Numpy's integers and floating-point numbers, and selected the appropriate number of digits based on the size of the integer.
* gguf_reader:
  * add `ReaderField.get`: to return the value of the field
* Unit test added.

Related Issues: ggerganov#4868, ggerganov#2872
@iamlemec
Copy link
Collaborator

iamlemec commented Feb 2, 2024

Hi folks! I went ahead and updated the existing BERT implementations to work with GGML master. See here: https://github.com/iamlemec/bert.cpp/. This is forked from @xyzhang626's embeddings.cpp and consequently @skeskinen's bert.cpp. Mostly graph and allocation related changes, and the recent 4d copy commits ended up being necessary for it to work.

I've tested CPU and CUDA with various quantization levels, but I don't have acces to Metal. Preliminary benchmarks on bge class models show a roughly 3x speedup relative to ONNX on CPU, while on CUDA ONNX is actually about 2x faster. But maybe flash attention will change that?

I don't want to be too presumptuous, as I'm not the original author, but it would seem to me that including this in ggml would increase its visibility and improve the chances that it stays up to date with the evolution of the ggml interface.

@sroussey
Copy link
Contributor

sroussey commented Feb 3, 2024

Nice!

Might add to the readme that you need to pip install gguf and some others that I happened to have installed, or create a requirements file.

@sroussey
Copy link
Contributor

sroussey commented Feb 3, 2024

For fun and speed, I tried adding to CMakeLists.txt:
option(GGML_METAL "ggml: use Metal" ON)

But

ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2 Max
ggml_metal_init: picking default device: Apple M2 Max
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: error: could not use bundle path to find ggml-metal.metal, falling back to trying cwd
ggml_metal_init: loading 'ggml-metal.metal'
ggml_metal_init: error: Error Domain=NSCocoaErrorDomain Code=260 "The file “ggml-metal.metal” couldn’t be opened because there is no such file." UserInfo={NSFilePath=ggml-metal.metal, NSUnderlyingError=0x12ebbb020 {Error Domain=NSPOSIXErrorDomain Code=2 "No such file or directory"}}
bert_load_from_file: ggml_backend_metal_init() failed
bert_load_from_file: using CPU backend

So I tried like this:

❯ GGML_METAL_PATH_RESOURCES=ggml/src/ build/bin/main -m models/bge-base-en-v1.5/ggml-model-f16.gguf -p "Hello world"                                                                
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2 Max
ggml_metal_init: picking default device: Apple M2 Max
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = ggml/src/
ggml_metal_init: loading 'ggml/src/ggml-metal.metal'
ggml_metal_init: GPU name:   Apple M2 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple8  (1008)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 51539.61 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =   209.09 MiB, (  210.72 / 49152.00)
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =     0.02 MiB, (  210.73 / 49152.00)
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =  1109.16 MiB, ( 1319.88 / 49152.00)
101 -> [CLS]
7592 -> hello
2088 -> world
102 -> [SEP]

ggml_metal_graph_compute_block_invoke: error: unsupported op 'REPEAT'
GGML_ASSERT: /Users/steve/Code/bert.cpp/ggml/src/ggml-metal.m:760: !"unsupported op"
[1]    28616 abort      GGML_METAL_PATH_RESOURCES=ggml/src/ build/bin/main -m  -p "Hello world"

I'm looking forward to this work getting integrated into a PR for llama.cpp so embedding (which is a real practical use case, and ideal for client side indexing) gets more attention!

@iamlemec
Copy link
Collaborator

iamlemec commented Feb 3, 2024

@sroussey Thanks for the feedback! Just added in a requirements.txt and updated the CMakeLists.txt file to copy over the ggml-metal.metal file on configuration. I think that should do the trick.

Agreed on the last point, and in that case we could use the more well developed tokenization infrastructure from llama.cpp as well.

@sroussey
Copy link
Contributor

sroussey commented Feb 3, 2024

@iamlemec Some other things

  1. How do you get the embeddings from main? I want to match the numbers for a model elsewhere that should have near the same numbers, so should output all the dimensions of the result
  2. What is it doing creating 32 copies of the tokens to create a batch? I took this out and of course its faster. ;)

@slaren
Copy link
Collaborator

slaren commented Feb 3, 2024

ggml_metal_graph_compute_block_invoke: error: unsupported op 'REPEAT'

Regarding this error, the CPU, CUDA and Metal backends support full broadcasting in the ggml_add operation, and it should be possible to remove the ggml_repeat operation entirely. Looking at the code it may require reshaping some tensors, but I think it should be possible.

@iamlemec
Copy link
Collaborator

iamlemec commented Feb 3, 2024

@sroussey For any testing and benchmarking, I usually just use the Python interface, which should have ~zero overhead. But yeah, the 32 was to test batching, and I was just printing the first 8 to make sure they didn't differ from the baseline. Now I've changed main so it just does an n=1 batch and outputs the full embedding vector.

@slaren Yeah, seems like it. I guess I could keep the attention mask as a 4d tensor rather than combining the head and batch dimensions as it does currently. Then adding it to the KQ matrix will broadcast implicitly.

@iamlemec
Copy link
Collaborator

iamlemec commented Feb 3, 2024

@sroussey Ok, I believe ggml_repeat has been excised. And actually it obviated a bunch of other reshape ops. Might work on Metal now!

@sroussey
Copy link
Contributor

sroussey commented Feb 3, 2024

Metal works now! Embedding two words is actually slower, but not that surprised. I'll need to do some testing later with real data.

BTW: I made a PR to fix a segfault, and another to have all the messages go to stderr except the array itself so it can be piped around.

@ggerganov
Copy link
Owner Author

ggerganov commented Feb 3, 2024

You can replace these with ggml_soft_max_ext():

// before
KQ = ggml_scale_inplace(ctx0, KQ, 1.0f / sqrt((float)d_head));
KQ = ggml_add(ctx0, KQ, attn_mask);
KQ = ggml_soft_max(ctx0, KQ);

// after
KQ = ggml_soft_max_ext(ctx0, KQ, attn_mask, 1.0f / sqrt((float)d_head));

Edit: nvm, it's okay the way it is (iamlemec/bert.cpp#3)

@cebtenzzre
Copy link
Collaborator

Which repo do we plan to eventually integrate BERT into - is it still llama.cpp? I would like to port Nomic Embed to GGML so GPT4All can use it locally (its architecture is very similar to BERT), and also drop our non-GPU-accelerated, non-batching variant of bert.cpp in the process. But I'm not sure how best to contribute.

Also @iamlemec - could you shed light on why this normalization of the output was added?

    inpL = ggml_rms_norm(ctx0, inpL, layer_norm_eps); // [E, B]
    inpL = ggml_scale_inplace(ctx0, inpL, 1.0f / sqrt((float)n_embd)); // [E, B] (since rms_norm does mean instead of sum)

cc @ggerganov

@ggerganov
Copy link
Owner Author

I would like at some point to have embedding models supported in llama.cpp. Not sure if the existing llama_get_embeddings API would be enough though - any thoughts on this?

But I'm not sure how best to contribute.

Let me know about any specific roadblocks and we can discuss. From brief look in bert.cpp, I think it is completely possible to add BERT support in llama.cpp

@iamlemec
Copy link
Collaborator

iamlemec commented Feb 7, 2024

@cebtenzzre That final normalization is just the usual L2-normalization. Looking into it, I realized that most implementations of BERT (like HF) don't normalize in the core model and only do it later at a higher level. So I just added a normalize flag to the embed interface that gets passed to the graph construction code.

@ggerganov Could have a PR for you later today. Were you thinking of putting it in examples a la LLaVa or more integrated into the core? As an example, I basically have working right now. For the more integrated approach, seems like llama_get_embeddings would work and I could just add a build_bert function and work out any conversion issues.

@cebtenzzre
Copy link
Collaborator

@iamlemec Here is my rough initial attempt at integrating BERT into core llama.cpp, which I started before I found your fork. Hopefully you find some part of it useful, I have not spent much time comparing to your implementation but I know that it is missing some of your improvements: https://github.com/ggerganov/llama.cpp/tree/ceb/bert

@iamlemec
Copy link
Collaborator

iamlemec commented Feb 7, 2024

@cebtenzzre Thanks! This is really helpful. Seems like I can build off of this and just add in my changes to build_bert.

I think the biggest difference going to BERT style models is that the attention is not causal. Right now it looks like causal attention is hard-coded in llama_build_graph. I'll add in a bool causal_attn flag to control this that defaults to true. Does it make more sense for this to go in llama_cparams or llama_hparams?

@cebtenzzre
Copy link
Collaborator

Does it make more sense for this to go in llama_cparams or llama_hparams?

It should be in llama_hparams. llama_cparams is for things the user can change via llama_new_context_with_model.

@ggerganov
Copy link
Owner Author

Were you thinking of putting it in examples a la LLaVa or more integrated into the core?

It should be integrated in the core library and have support for converting BERT models to GGUF format and loading them as any other model. There could be an accompanying example to demonstrate basic usage

I think the biggest difference going to BERT style models is that the attention is not causal. Right now it looks like causal attention is hard-coded in llama_build_graph

You can change the KQ_mask to non-causal if the model arch is BERT. Currently, it always constructs causal mask:

llama.cpp/llama.cpp

Lines 7047 to 7070 in 8504d2d

{
const int64_t n_kv = llm.n_kv;
const int64_t n_tokens = batch.n_tokens;
GGML_ASSERT(ggml_backend_buffer_is_host(lctx.inp_KQ_mask->buffer));
float * data = (float *) lctx.inp_KQ_mask->data;
for (int h = 0; h < 1; ++h) {
for (int j = 0; j < n_tokens; ++j) {
const llama_pos pos = batch.pos[j];
const llama_seq_id seq_id = batch.seq_id[j][0];
for (int i = 0; i < n_kv; ++i) {
float f;
if (!lctx.kv_self.cells[i].has_seq_id(seq_id) || lctx.kv_self.cells[i].pos > pos) {
f = -INFINITY;
} else {
f = 0;
}
data[h*(n_kv*n_tokens) + j*n_kv + i] = f;
}
}
}
}

But we can add a check if it is an embedding model and construct the appropriate mask. This way you wouldn't need an extra flag and in the future if other embedding models are added, they can reuse that mask. Would that work?

@iamlemec
Copy link
Collaborator

iamlemec commented Feb 8, 2024

Ok, I think I have it working! Current status is at: https://github.com/iamlemec/llama.cpp. Happy to make it a PR if you think it's ready.

I ended up adding a WordPiece tokenizer called llm_tokenizer_wpm. This should be similar to llm_tokenizer_spm, but I was having issues with spm not choosing the longest matching token. It seems possible there's some way to unify these, but I couldn't quite figure it out.

As you suggested, I added in a causal_attn bool to hparams and a coresponding key bert.attention.causal to the GGUF converter. These are then reflected around where that code you quoted above is.

I was a little unsure about how to get the output correctly. Right now, it looks at the last node, and if it's called "result_embed" it treats it as a pooled embedding. Otherwise, it just executes the old logic (which might include getting a last token embedding from a proper LLM).

Also, the pooling layer is a bit tricky. Right now it's averaging over the entire batch. But if you had multiple sequences, it would average over all of those, so essentially it will only work properly for single-sequence batches. Is there some way to sum within sequence? I was thinking you could preconstruct a matrix of size [n_seqs, n_tokens] and matmul with that.

@slaren
Copy link
Collaborator

slaren commented Feb 8, 2024

It would be good to open a PR even if it is still a draft, so that we can see the overall changes and give more specific feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
model Model specific
Projects
Status: Done
Development

No branches or pull requests