Skip to content

Conversation

@harryzwh
Copy link

The original intention was to add support for Qwen3-Reranker gguf format in xinference. However, it was found that llama.cpp's support for Qwen3-Reranker was not yet complete.
For the time being, bge-reranker-v2-m3 was used for testing.
More discussion on llama.cpp's support for Qwen3-Reranker can be found here, and llama.cpp is also working on supporting qwen3 rerank, see here.

@codingl2k1
Copy link

I formatted the code and modified the rerank API to support str/bytes json. Since the CI only uses bge-reranker-v2-m3-Q8_0.gguf, the Makefile only downloads bge-reranker-v2-m3-Q8_0.gguf. I pushed the commits directly to your branch.

@harryzwh
Copy link
Author

I formatted the code and modified the rerank API to support str/bytes json. Since the CI only uses bge-reranker-v2-m3-Q8_0.gguf, the Makefile only downloads bge-reranker-v2-m3-Q8_0.gguf. I pushed the commits directly to your branch.

Great! I am going to add the support of json string/bytes as what you recently did for the embedding API. Happy to see you already did it.

@harryzwh
Copy link
Author

You may download the rerank model from huggingface if it is still the networking issue.
https://huggingface.co/gpustack/bge-reranker-v2-m3-GGUF/resolve/main/bge-reranker-v2-m3-Q2_K.gguf

@harryzwh
Copy link
Author

Interesting error that only happen on Mac and seems it is a hardware related issue according to this.

@codingl2k1
Copy link

Interesting error that only happen on Mac and seems it is a hardware related issue according to this.

I reviewed the server binding, and the cleanup is well handled. There may be some issues caused by llama.cpp. I tested running test_llama_server_multimodal in a dedicated process, and it still crashes at test_llama_server_rerank.

@codingl2k1
Copy link

When I switched to Qwen3-Reranker-0.6B.Q2_K.gguf, I got this error: /home/runner/work/xllamacpp/xllamacpp/thirdparty/llama.cpp/src/llama-graph.cpp:1907: RANK pooling requires either cls+cls_b or cls_out+cls_out_b

related issue: https://huggingface.co/Mungert/Qwen3-Reranker-4B-GGUF/discussions/1

Also, the model bge-reranker-v2-m3-Q2_K.gguf crashes on macOS CI. The rerank feature in llama.cpp is still quite experimental, it is not an issue with our binding. I can skip the test and merge this PR.

@harryzwh
Copy link
Author

When I switched to Qwen3-Reranker-0.6B.Q2_K.gguf, I got this error: /home/runner/work/xllamacpp/xllamacpp/thirdparty/llama.cpp/src/llama-graph.cpp:1907: RANK pooling requires either cls+cls_b or cls_out+cls_out_b

related issue: https://huggingface.co/Mungert/Qwen3-Reranker-4B-GGUF/discussions/1

Also, the model bge-reranker-v2-m3-Q2_K.gguf crashes on macOS CI. The rerank feature in llama.cpp is still quite experimental, it is not an issue with our binding. I can skip the test and merge this PR.

Yes, same error from my side under Linux platform. As mentioned in the very beginning, llamacpp currently still working on supporting qwen3-reranker, that is why using bge-reranker-v2-m3 for testing. Will keep looking at the progress from llamacpp.

@codingl2k1 codingl2k1 merged commit 6a69587 into xorbitsai:main Sep 14, 2025
4 checks passed
@harryzwh harryzwh deleted the rerank branch September 14, 2025 16:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants