Conversation
| if (!classifier_labels.empty()) { | ||
| hparams.n_cls_out = classifier_labels.size(); | ||
| hparams.n_embd_out = classifier_labels.size(); | ||
| } |
There was a problem hiding this comment.
This crashes f.ex. jina-bert-v2, as observed in CIs because n_cls_out defaults to 1 for models that don't have any classifier labels.
There was a problem hiding this comment.
thanks @CISC , rolled back to using n_cls_out for pooling = rank for now
894f76d to
e8a0336
Compare
CISC
left a comment
There was a problem hiding this comment.
LGTM, @ggerganov should look over n_embd_out changes.
convert_hf_to_gguf.py
Outdated
|
|
||
| def set_vocab(self): | ||
| super().set_vocab() | ||
| self.gguf_writer.add_add_bos_token(False) |
There was a problem hiding this comment.
Why set this to False BTW? It's true in the config, and it's used in TemplateProcessing both for single and pair.
There was a problem hiding this comment.
So, if the config is correct both add_add_bos_token and add_add_sep_token should be True, and sep token should be set to bos token.
Ideally this should be automatically done by SpecialVocabs TemplateProcessing, but IIRC this pattern isn't accepted (a warning is logged), don't remember exactly why.
There was a problem hiding this comment.
nice catch, that's a debug leftover (was dealing with double BOS), will remove.
There was a problem hiding this comment.
I think you still need to add sep metadata as mentioned for reranking to work.
There was a problem hiding this comment.
ColBERT doesn't need a sep token, it embeds queries and documents separately, and then the similarity score is calculated pairwise using maxsim. See the script attached here #18607 (comment).
This allows embedding documents once, caching document embeddings in the database, and then embedding only a query and computing a similarity score against precomputed document embeddings.
There was a problem hiding this comment.
I was thinking of adding this logic to llama.cpp, but then it will require document database management, and I decided to leave it to the client.
There was a problem hiding this comment.
Ah, so TemplateProcessing is not used at all.
ggerganov
left a comment
There was a problem hiding this comment.
Should update examples/embedding and examples/model-conversionto use the newllama_model_n_embd_out()`.
|
Thanks for the feedback. Addressed the comment. |
| from safetensors.torch import load_file | ||
| tensors_file = self.dir_model / "1_Dense" / "model.safetensors" | ||
| assert tensors_file.is_file() | ||
| tensor = load_file(tensors_file)["linear.weight"] |
There was a problem hiding this comment.
Not a change request, but I'm wondering if we should introduce an extra_model_files() API in near future @CISC @compilade ?
Something like this:
class LFM2ColBertModel(LFM2Model):
def extra_model_dir():
return [self.dir_model / "1_Dense" / "model.safetensors"]
# tensors will be loaded and processed via `modify_tensors()`There was a problem hiding this comment.
Yep, would be handy to deduplicate some code, I guess we'll see more of this as more ST support gets added.
I checked out some other ColBERTv2 models BTW, and they seem to have linear.weight embedded in main file.
danbev
left a comment
There was a problem hiding this comment.
I'll take a look at updating the model-conversion example as a separate PR if that is alright. I'm currently working on #18464 and I can either update it or do this as a separate PR.
I can also look at updating llama-embedding as a separate PR though it looks like it only requires a small update:
diff --git a/examples/embedding/embedding.cpp b/examples/embedding/embedding.cpp
index 81111e81b..89e626353 100644
--- a/examples/embedding/embedding.cpp
+++ b/examples/embedding/embedding.cpp
@@ -252,7 +252,7 @@ int main(int argc, char ** argv) {
}
// allocate output
- const int n_embd = llama_model_n_embd(model);
+ const int n_embd = llama_model_n_embd_out(model);
std::vector<float> embeddings(n_embd_count * n_embd, 0);
float * emb = embeddings.data();
@danbev I updated them in this commit 98d7da6, can revert if you prefer to do it in a separate PR. |
No, this is great! I just missed that commit. I was able to verify the converted model using the |
|
For visibility, GGUFs and usage instructions are uploaded to https://huggingface.co/LiquidAI/LFM2-ColBERT-350M-GGUF |
This commit adds a Python script to automatically detect the pooling configuration from a sentence-transformers model directory. The motivation for this change is that I make a mistake when adding the sentence-transformers support and I incorrectly assumed that if an embedding model uses sentence-transformers, it always used pooling. With the recent addition of support for late interaction models, which can have a down-projection but do not use pooling (like LFM2-ColBert-350M). This commit builds upon ggml-org#18464 which needs to be merged first. Refs: ggml-org#18607 (comment)
This commit adds a Python script to automatically detect the pooling configuration from a sentence-transformers model directory. The motivation for this change is that I make a mistake when adding the sentence-transformers support and I incorrectly assumed that if an embedding model uses sentence-transformers, it always used pooling. With the recent addition of support for late interaction models, which can have a down-projection but do not use pooling (like LFM2-ColBert-350M). Refs: ggml-org#18607 (comment)
PR adds support for LFM2-ColBert-350M by introducing
n_embd_out- a separate output embedding dimension that can differ from the input embedding dimension (n_embd).Initially, I introduced
LLAMA_POOLING_TYPE_TOKEN, which was applyingcls_outand outputting all embedding, but then switched ton_embd_out.n_embd_outwill be used in future multimodal models as well.New GGUF key and API:
LLM_KV_EMBEDDING_LENGTH_OUT- stores output embedding dimensionhparams.n_embd_outif set and fallbacks tohparams.n_embdTesting
Convert
Launch server
Run the attached Python script
rerank.py
cc: @ngxson