[Feature]: Support for multiple embedding types in a single inference call#35829
[Feature]: Support for multiple embedding types in a single inference call#35829noooop merged 20 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
The pull request introduces support for multiple embedding types in a single inference call, which is a significant feature enhancement. The changes involve updating various components to handle a list of pooling tasks instead of a single task. This includes modifications to request mixins, pooling parameters, IO processor logic, and model runner output structures. The test cases have also been updated to cover the new multi-task scenarios.
However, there are critical issues identified in the handling of request queues within the BgeM3SparseEmbeddingsProcessor and a list initialization bug in DispatchPooler that could lead to incorrect behavior or crashes. Additionally, the parameter validation for multiple pooling tasks in PoolingParams needs to be more robust.
tests/plugins/bge_m3_sparse_plugin/bge_m3_sparse_processor/sparse_embeddings_processor.py
Outdated
Show resolved
Hide resolved
tests/plugins/bge_m3_sparse_plugin/bge_m3_sparse_processor/sparse_embeddings_processor.py
Outdated
Show resolved
Hide resolved
tests/plugins/bge_m3_sparse_plugin/bge_m3_sparse_processor/sparse_embeddings_processor.py
Outdated
Show resolved
Hide resolved
|
Hi @staugust, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
There was a problem hiding this comment.
Hmm, at the end of the day we are still having multiple tasks in PoolingParams. So IOProcessor cannot really avoid this complexity. If that's the case then I suggest splling up the PR into two parts: the first part supports multiple tasks in PoolingParams while the second part updates the plugin.
|
@DarkLight1337 ok, shall we update |
|
Sure |
0741f23 to
0dc2998
Compare
|
Hi @staugust, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
I don't think it's worth making such a big change for this feature. I think it would be better to add one or more combination of tasks, rather than supporting multiple tasks. For this specific requirement, we only need to add one task to output the results of embed + token_embed + token_classify. Do we really need to broadly support multiple tasks? |
|
Hi @staugust, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
@noooop I've modify PoolingTask and two related PoolingRequest data class. For sparse&dense embedding, we only need to add 'token_classify+embed'. Anyway, |
|
Hi @noooop, @DarkLight1337 — just wanted to kindly check if you have any updates or feedback on this PR when you have a moment. Thanks! |
|
This pull request has merge conflicts that must be resolved before it can be |
628fe67 to
0b0be64
Compare
|
Hi @staugust, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>
Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>
Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>
Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>
Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>
Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>
Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>
Head branch was pushed to by a user without write access
b4003e3 to
c19f266
Compare
… call (vllm-project#35829) Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>
… call (vllm-project#35829) Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>
… call (vllm-project#35829) Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>
… call (vllm-project#35829) Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>
|
We are very sorry, but we plan to deprecate support for multi-task, (for v2 runner), meaning users will need to select only one of them at the beginning. Please modify this bge-m3 plugin to use only the dense & sparse tasks, meaning that for all requests, the dense & sparse tasks will be computed. When returning results to users, filter the content they need. This may increase overhead, but we have no other choice. |
| self.eos_token_id, | ||
| "token_classify": token_classify_pooler, | ||
| "embed&token_classify": BgeM3Pooler( | ||
| token_classify_pooler, embed_pooler | ||
| ), |
There was a problem hiding this comment.
Sorry, I'm not 100% sure how you implemented support for dense, sparse, and dense&sparse simultaneously, but you need to change it to only use "embed&token_classify": BgeM3Pooler.
@noooop Can a pooling model still support multi pooling tasks by |
|
pooling task is specified by command line |
|
We had some discussions, summaries: Make all supported pooling tasks available simultaneously (multi-task support) vs selecting only one at the start.
|
|
If you have any important information, feel free to add. |
|
@noooop I’d like to understand the reasoning behind v2 runner only supporting a single specific task at model serving instance startup. |
|
This will simplify the design of the v2 runner; otherwise, supporting all pooling tasks running simultaneously (multi-task support) would make the runner complex. |
|
@noooop ok, I'll update this plugin when v2 runner is ready. |
|
It’s because the block v2 feature is being rolled out, so we need to deprecate support for multi-task first. That means you can proceed with implementing this feature now. |
|
ok, I'll issue a pr to update plugin. |
… call (vllm-project#35829) Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>
… call (vllm-project#35829) Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>
… call (vllm-project#35829) Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com> Signed-off-by: Monishver Chandrasekaran <monishverchandrasekaran@gmail.com>
… call (vllm-project#35829) Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>
Purpose
As issue #35190 proposed, support both sparse&dense embedding in a single inference call.
Test Plan
./tests/plugins_tests/test_bge_m3_sparse_io_processor_plugins.py
Test Result
All test cases passed. Both sparse&dense embedding is ok.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.