-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama : add BERT support #2872
Comments
@monatis Happy to help with testing or any other tasks you may need help with. Just keep in mind my understanding of LLM architecture is quite limited. |
@ggerganov Is this claimed by somebody? otherwise I can give this a shot! |
I think @monatis is already working on it |
Yes I'm on the work, I'll raise a PR early this week. |
@monatis did this ever happen? I did not see a PR, RAG (retrieval augmented generation) is a major application for which bert models are heavily used. It would be good to add bert model support. |
Field report: https://bloop.ai/blog/gpu_with_ggml |
Ouch, after that I was diving deep into LLaVA. Now that it's shipped, I can go back to this one. |
Maybe m0saan can take care of this? |
While on it maybe worth adding https://jina.ai/news/jina-ai-launches-worlds-first-open-source-8k-text-embedding-rivaling-openai/ :p doesn’t seems to be a regular Bert model though see https://huggingface.co/jinaai/jina-bert-implementation/tree/main |
Is there any progress? |
Hello @yunghoy, Is someone working on this, otherwise I'll work on it! Thankies |
@m0saan, I don't think anyone is working on it; originally it was assigned to monatis, but it never happened; so my humble opinion is that it will be great if you work on it. |
Hi @m0saan, could you please share the implementation process? I'm also interested in the topic and am willing to help with testing or implementing |
References: HF's tokenizers wrapper tokenizers-cpp . |
I can take care of the integration if no one is working on it. |
@obezzad , thank you so much. That would be awesome. I can help with testing when a PR is ready. |
Hey guys, I made some progress based on the fork of bert.cpp, please check it out if you are still interested. Support the SOTA BGE series model, real batch inference, multilingual tokenizer and no 3rd party dependencies. embeddings.cpp |
@xyzhang626 all tokenizer test failed after using // examples/test_tokenizer.cpp
int main(int argc, char ** argv) {
bert_params params;
params.model = "models/bge-base-zh-v1.5/ggml-model-q4_0.bin";
... |
hey @snowyu it's because that test is written for the To not pollute this thread, we can discuss more details in this post. |
The Pre-Tokenizer component in Hugging Face Transformers is highly versatile, offering various types that can be dynamically applied to text before it's tokenized by the main Tokenizer class. Currently, To achieve a pure C++ implementation that fully leverages the capabilities provided by Hugging Face's Tokenizers library and its tokenizer_config.json file, we need to implement all available tokenizers in addition to reading this configuration file correctly. This would allow us to apply different pre-processing steps as specified for each model checkpoint directory within LLaMA or other similar benchmarking tools. For example, consider the {
...,
"pre_tokenizer": {
"type": "Sequence",
"pretokenizers": [
{"type": "WhitespaceSplit"},
{"type": "Metaspace","replacement": "▁", ...}
]
}
} This configuration file specifies a sequence of pre-tokenization steps that should be applied to the text before it's tokenized. Implementing such configurations in C++ would require parsing this JSON structure and applying each step accordingly, ensuring compatibility with Hugging Face Tokenizers library while maintaining performance efficiency within llama.cpp. For a clearer example of how these pre-tokenization steps might be implemented using JavaScript code in one file, you can refer to the source for transformers.js's tokenizer implementation: https://github.com/xenova/transformers.js/blob/main/src/tokenizers.js |
The content of the OBJ type is actually a list of all key names of the object. This change includes several improvements and additions to the codebase: * GGUFWriter: * Added `def add_kv(self, key: str, val: Any) -> None` method: Automatically determines the appropriate value type based on val. * Added `def add_dict(self, key: str, val: dict) -> None` method: add object(dict) key-value * constants: * Revised `GGUFValueType.get_type(val)`: Added support for numpy's integers and floating point numbers, and appropriately selected the number of digits according to the size of the integer. * gguf_reader * Added `ReaderField.get()` method: get the value of this ReaderField * Unit tests have been added to cover these changes. Related Issues: ggerganov#4868, ggerganov#2872
The content of the OBJ type is actually a list of all key names of the object. * GGUFWriter: * add `def add_kv(self, key: str, val: Any) -> None`: This will be added based on the val type * add `def add_dict(self, key: str, val: dict) -> None`: add object(dict) value * constants: * `GGUFValueType.get_type`: Added support for Numpy's integers and floating-point numbers, and selected the appropriate number of digits based on the size of the integer. * gguf_reader: * add `ReaderField.get`: to return the value of the field * Unit test added. Related Issues: ggerganov#4868, ggerganov#2872
Hi folks! I went ahead and updated the existing BERT implementations to work with GGML master. See here: https://github.com/iamlemec/bert.cpp/. This is forked from @xyzhang626's embeddings.cpp and consequently @skeskinen's bert.cpp. Mostly graph and allocation related changes, and the recent 4d copy commits ended up being necessary for it to work. I've tested CPU and CUDA with various quantization levels, but I don't have acces to Metal. Preliminary benchmarks on I don't want to be too presumptuous, as I'm not the original author, but it would seem to me that including this in |
Nice! Might add to the readme that you need to |
For fun and speed, I tried adding to CMakeLists.txt: But ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2 Max
ggml_metal_init: picking default device: Apple M2 Max
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: error: could not use bundle path to find ggml-metal.metal, falling back to trying cwd
ggml_metal_init: loading 'ggml-metal.metal'
ggml_metal_init: error: Error Domain=NSCocoaErrorDomain Code=260 "The file “ggml-metal.metal” couldn’t be opened because there is no such file." UserInfo={NSFilePath=ggml-metal.metal, NSUnderlyingError=0x12ebbb020 {Error Domain=NSPOSIXErrorDomain Code=2 "No such file or directory"}}
bert_load_from_file: ggml_backend_metal_init() failed
bert_load_from_file: using CPU backend So I tried like this: ❯ GGML_METAL_PATH_RESOURCES=ggml/src/ build/bin/main -m models/bge-base-en-v1.5/ggml-model-f16.gguf -p "Hello world"
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2 Max
ggml_metal_init: picking default device: Apple M2 Max
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = ggml/src/
ggml_metal_init: loading 'ggml/src/ggml-metal.metal'
ggml_metal_init: GPU name: Apple M2 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple8 (1008)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001)
ggml_metal_init: simdgroup reduction support = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 51539.61 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 209.09 MiB, ( 210.72 / 49152.00)
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 0.02 MiB, ( 210.73 / 49152.00)
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 1109.16 MiB, ( 1319.88 / 49152.00)
101 -> [CLS]
7592 -> hello
2088 -> world
102 -> [SEP]
ggml_metal_graph_compute_block_invoke: error: unsupported op 'REPEAT'
GGML_ASSERT: /Users/steve/Code/bert.cpp/ggml/src/ggml-metal.m:760: !"unsupported op"
[1] 28616 abort GGML_METAL_PATH_RESOURCES=ggml/src/ build/bin/main -m -p "Hello world" I'm looking forward to this work getting integrated into a PR for llama.cpp so embedding (which is a real practical use case, and ideal for client side indexing) gets more attention! |
@sroussey Thanks for the feedback! Just added in a Agreed on the last point, and in that case we could use the more well developed tokenization infrastructure from llama.cpp as well. |
@iamlemec Some other things
|
Regarding this error, the CPU, CUDA and Metal backends support full broadcasting in the |
@sroussey For any testing and benchmarking, I usually just use the Python interface, which should have ~zero overhead. But yeah, the 32 was to test batching, and I was just printing the first 8 to make sure they didn't differ from the baseline. Now I've changed @slaren Yeah, seems like it. I guess I could keep the attention mask as a 4d tensor rather than combining the head and batch dimensions as it does currently. Then adding it to the KQ matrix will broadcast implicitly. |
@sroussey Ok, I believe |
Metal works now! Embedding two words is actually slower, but not that surprised. I'll need to do some testing later with real data. BTW: I made a PR to fix a segfault, and another to have all the messages go to stderr except the array itself so it can be piped around. |
You can replace these with // before
KQ = ggml_scale_inplace(ctx0, KQ, 1.0f / sqrt((float)d_head));
KQ = ggml_add(ctx0, KQ, attn_mask);
KQ = ggml_soft_max(ctx0, KQ);
// after
KQ = ggml_soft_max_ext(ctx0, KQ, attn_mask, 1.0f / sqrt((float)d_head)); Edit: nvm, it's okay the way it is (iamlemec/bert.cpp#3) |
Which repo do we plan to eventually integrate BERT into - is it still llama.cpp? I would like to port Nomic Embed to GGML so GPT4All can use it locally (its architecture is very similar to BERT), and also drop our non-GPU-accelerated, non-batching variant of bert.cpp in the process. But I'm not sure how best to contribute. Also @iamlemec - could you shed light on why this normalization of the output was added? inpL = ggml_rms_norm(ctx0, inpL, layer_norm_eps); // [E, B]
inpL = ggml_scale_inplace(ctx0, inpL, 1.0f / sqrt((float)n_embd)); // [E, B] (since rms_norm does mean instead of sum) cc @ggerganov |
I would like at some point to have embedding models supported in
Let me know about any specific roadblocks and we can discuss. From brief look in |
@cebtenzzre That final normalization is just the usual L2-normalization. Looking into it, I realized that most implementations of BERT (like HF) don't normalize in the core model and only do it later at a higher level. So I just added a @ggerganov Could have a PR for you later today. Were you thinking of putting it in |
@iamlemec Here is my rough initial attempt at integrating BERT into core llama.cpp, which I started before I found your fork. Hopefully you find some part of it useful, I have not spent much time comparing to your implementation but I know that it is missing some of your improvements: https://github.com/ggerganov/llama.cpp/tree/ceb/bert |
@cebtenzzre Thanks! This is really helpful. Seems like I can build off of this and just add in my changes to I think the biggest difference going to BERT style models is that the attention is not causal. Right now it looks like causal attention is hard-coded in |
It should be in llama_hparams. llama_cparams is for things the user can change via llama_new_context_with_model. |
It should be integrated in the core library and have support for converting BERT models to GGUF format and loading them as any other model. There could be an accompanying example to demonstrate basic usage
You can change the Lines 7047 to 7070 in 8504d2d
But we can add a check if it is an embedding model and construct the appropriate mask. This way you wouldn't need an extra flag and in the future if other embedding models are added, they can reuse that mask. Would that work? |
Ok, I think I have it working! Current status is at: https://github.com/iamlemec/llama.cpp. Happy to make it a PR if you think it's ready. I ended up adding a WordPiece tokenizer called As you suggested, I added in a I was a little unsure about how to get the output correctly. Right now, it looks at the last node, and if it's called "result_embed" it treats it as a pooled embedding. Otherwise, it just executes the old logic (which might include getting a last token embedding from a proper LLM). Also, the pooling layer is a bit tricky. Right now it's averaging over the entire batch. But if you had multiple sequences, it would average over all of those, so essentially it will only work properly for single-sequence batches. Is there some way to sum within sequence? I was thinking you could preconstruct a matrix of size |
It would be good to open a PR even if it is still a draft, so that we can see the overall changes and give more specific feedback. |
There is a working bert.cpp implementation.
We should try to implement this in
llama.cpp
and update theembedding
example to use it.The implementation should follow mostly what we did to integrate Falcon.
Here are the main steps:
gguf.py
with BERT arch KV pairs and tensorsgguf.py
to generate F16 modelllama.cpp
ggml
if neededThe text was updated successfully, but these errors were encountered: