always use embed&token_classify for bge-m3#37632
Conversation
Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>
There was a problem hiding this comment.
Code Review
This pull request refactors the bge-m3 sparse embedding processor to consistently use the embed&token_classify task for all pooling requests. This simplifies the logic in merge_pooling_params by removing conditional task assignments. Correspondingly, the post_process method has been updated to always calculate embed_dimensions, ensuring correct slicing of the model's output to extract dense and/or sparse embeddings as requested. The changes align with the stated purpose of supporting a fixed model serving configuration and appear to be implemented correctly.
|
@noooop PTAL, thanks. |
|
If the user only wants to use one type of embeddings, is it necessary for them to use the plugin? If it's necessary, then this would result in some unnecessary tensor transfer since you always return both dense and sparse embeddings. |
|
@DarkLight1337 Not only unnecessary tensor transfer, but also unnecessary gpu computation for both dense&sparse if user only want dense or sparse embeddings. v2 runner only support one pooling task at starting point. Thus, for dense embedding, user shall call |
|
|
tests/plugins/bge_m3_sparse_plugin/bge_m3_sparse_processor/sparse_embeddings_processor.py
Show resolved
Hide resolved
Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com> Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io>
Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com> Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io>
Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com> Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io>
Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com> Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io>
Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com> Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io>
Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com> Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io> Signed-off-by: Monishver Chandrasekaran <monishverchandrasekaran@gmail.com>
Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com> Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io> Signed-off-by: Nithin Chalapathi <nithin.ch10@gmail.com>
Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com> Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io>
Purpose
As discussed in #35829 (comment) , model serving instance only serves one pooling task set at starting point. Hence, for
bge-m3withdense,sparse,dense&sparsemode, this plugin useembed&token_classifyfor all pooling requests, and returndense,sparse,dense&sparseaccording toembed_taskconfigured in the request.This is part of deprecate support for multi-task.
Test Plan
plugins_tests/test_bge_m3_sparse_io_processor_plugins.py
Test Result
All tests are passed.

Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.