Skip to content

[Feature]: Support for multiple embedding types in a single inference call#35829

Merged
noooop merged 20 commits intovllm-project:mainfrom
staugust:fuse_dense_sparse_embed
Mar 17, 2026
Merged

[Feature]: Support for multiple embedding types in a single inference call#35829
noooop merged 20 commits intovllm-project:mainfrom
staugust:fuse_dense_sparse_embed

Conversation

@staugust
Copy link
Copy Markdown
Contributor

@staugust staugust commented Mar 3, 2026

Purpose

As issue #35190 proposed, support both sparse&dense embedding in a single inference call.

Test Plan

./tests/plugins_tests/test_bge_m3_sparse_io_processor_plugins.py

Test Result

All test cases passed. Both sparse&dense embedding is ok.

image
Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request introduces support for multiple embedding types in a single inference call, which is a significant feature enhancement. The changes involve updating various components to handle a list of pooling tasks instead of a single task. This includes modifications to request mixins, pooling parameters, IO processor logic, and model runner output structures. The test cases have also been updated to cover the new multi-task scenarios.

However, there are critical issues identified in the handling of request queues within the BgeM3SparseEmbeddingsProcessor and a list initialization bug in DispatchPooler that could lead to incorrect behavior or crashes. Additionally, the parameter validation for multiple pooling tasks in PoolingParams needs to be more robust.

@mergify
Copy link
Copy Markdown

mergify bot commented Mar 3, 2026

Hi @staugust, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Copy link
Copy Markdown
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, at the end of the day we are still having multiple tasks in PoolingParams. So IOProcessor cannot really avoid this complexity. If that's the case then I suggest splling up the PR into two parts: the first part supports multiple tasks in PoolingParams while the second part updates the plugin.

@staugust
Copy link
Copy Markdown
Contributor Author

staugust commented Mar 3, 2026

@DarkLight1337 ok, shall we update pooling api to support multi pooling tasks?

@DarkLight1337
Copy link
Copy Markdown
Member

Sure

@staugust staugust force-pushed the fuse_dense_sparse_embed branch from 0741f23 to 0dc2998 Compare March 3, 2026 06:48
@mergify
Copy link
Copy Markdown

mergify bot commented Mar 3, 2026

Hi @staugust, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

@noooop
Copy link
Copy Markdown
Collaborator

noooop commented Mar 3, 2026

I don't think it's worth making such a big change for this feature.

I think it would be better to add one or more combination of tasks, rather than supporting multiple tasks.
For example, adding a task : "embed+token_embed+token_classify"

For this specific requirement, we only need to add one task to output the results of embed + token_embed + token_classify. Do we really need to broadly support multiple tasks?

@mergify
Copy link
Copy Markdown

mergify bot commented Mar 3, 2026

Hi @staugust, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

@staugust
Copy link
Copy Markdown
Contributor Author

staugust commented Mar 3, 2026

@noooop I've modify PoolingTask and two related PoolingRequest data class. For sparse&dense embedding, we only need to add 'token_classify+embed'. Anyway, core.py, scheduler.py and gpu_model_runner have to be updated to process two pooling_output of one request properly, PTAL, thanks.

@staugust
Copy link
Copy Markdown
Contributor Author

staugust commented Mar 5, 2026

Hi @noooop, @DarkLight1337 — just wanted to kindly check if you have any updates or feedback on this PR when you have a moment. Thanks!

@mergify
Copy link
Copy Markdown

mergify bot commented Mar 5, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @staugust.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@staugust staugust force-pushed the fuse_dense_sparse_embed branch from 628fe67 to 0b0be64 Compare March 10, 2026 07:35
@mergify
Copy link
Copy Markdown

mergify bot commented Mar 10, 2026

Hi @staugust, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>
Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>
Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>
Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>
Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>
Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>
Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>
auto-merge was automatically disabled March 17, 2026 07:08

Head branch was pushed to by a user without write access

@staugust staugust force-pushed the fuse_dense_sparse_embed branch from b4003e3 to c19f266 Compare March 17, 2026 07:08
@noooop noooop merged commit 9c7cab5 into vllm-project:main Mar 17, 2026
59 checks passed
zhenwei-intel pushed a commit to zhenwei-intel/vllm that referenced this pull request Mar 17, 2026
… call (vllm-project#35829)

Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>
Lucaskabela pushed a commit to Lucaskabela/vllm that referenced this pull request Mar 17, 2026
… call (vllm-project#35829)

Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>
andylolu2 pushed a commit to andylolu2/vllm that referenced this pull request Mar 18, 2026
… call (vllm-project#35829)

Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>
wendyliu235 pushed a commit to wendyliu235/vllm-public that referenced this pull request Mar 18, 2026
… call (vllm-project#35829)

Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>
@noooop
Copy link
Copy Markdown
Collaborator

noooop commented Mar 19, 2026

@staugust

We are very sorry, but we plan to deprecate support for multi-task, (for v2 runner), meaning users will need to select only one of them at the beginning.

Please modify this bge-m3 plugin to use only the dense & sparse tasks, meaning that for all requests, the dense & sparse tasks will be computed. When returning results to users, filter the content they need.

This may increase overhead, but we have no other choice.

Comment on lines -236 to 243
self.eos_token_id,
"token_classify": token_classify_pooler,
"embed&token_classify": BgeM3Pooler(
token_classify_pooler, embed_pooler
),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'm not 100% sure how you implemented support for dense, sparse, and dense&sparse simultaneously, but you need to change it to only use "embed&token_classify": BgeM3Pooler.

@staugust
Copy link
Copy Markdown
Contributor Author

staugust commented Mar 19, 2026

@staugust

We are very sorry, but we plan to deprecate support for multi-task, (for v2 runner), meaning users will need to select only one of them at the beginning.

Please modify this bge-m3 plugin to use only the dense & sparse tasks, meaning that for all requests, the dense & sparse tasks will be computed. When returning results to users, filter the content they need.

This may increase overhead, but we have no other choice.

@noooop Can a pooling model still support multi pooling tasks by /pooling api with different task configured in pooling request body ? Or pooling task is specified by command line?

@noooop
Copy link
Copy Markdown
Collaborator

noooop commented Mar 19, 2026

pooling task is specified by command line

@noooop
Copy link
Copy Markdown
Collaborator

noooop commented Mar 19, 2026

We had some discussions, summaries:

Make all supported pooling tasks available simultaneously (multi-task support) vs selecting only one at the start.

  1. Typically, model training focuses on a single task — 99.9% of usage scenarios only require picking one of them at the beginning.

  2. Having all supported pooling tasks ready at once is certainly fancier and gives users everything they want in one go.

  3. In terms of UX, if users have to pick just one at the start, they'd need to manually select token_embed for multivector retrieval. e.g. jina_embeddings_v4 https://github.com/vllm-project/vllm/blob/main/examples/pooling/token_embed/jina_embeddings_v4_offline.py

  4. BGE-M3 supports three tasks: embed for dense retrieval; token_embed for multi-vector retrieval; and token_classify for sparse retrieval.
    This PR ([Feature]: Support for multiple embedding types in a single inference call #35829) updates the BGE-M3 plugin to support dense, sparse, and dense+sparse modes, making it more convenient for BGE-M3 users.

@noooop
Copy link
Copy Markdown
Collaborator

noooop commented Mar 19, 2026

If you have any important information, feel free to add.

@staugust
Copy link
Copy Markdown
Contributor Author

@noooop I’d like to understand the reasoning behind v2 runner only supporting a single specific task at model serving instance startup.
If this is mainly for user experience, my assumption is that users calling the pooling API already know which pooling task their model supports, so it may not be necessary to configure the task via the command line.

@noooop
Copy link
Copy Markdown
Collaborator

noooop commented Mar 19, 2026

This will simplify the design of the v2 runner; otherwise, supporting all pooling tasks running simultaneously (multi-task support) would make the runner complex.

@staugust
Copy link
Copy Markdown
Contributor Author

@noooop ok, I'll update this plugin when v2 runner is ready.

@noooop
Copy link
Copy Markdown
Collaborator

noooop commented Mar 19, 2026

It’s because the block v2 feature is being rolled out, so we need to deprecate support for multi-task first. That means you can proceed with implementing this feature now.

@staugust
Copy link
Copy Markdown
Contributor Author

ok, I'll issue a pr to update plugin.

fxdawnn pushed a commit to fxdawnn/vllm that referenced this pull request Mar 19, 2026
… call (vllm-project#35829)

Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>
khairulkabir1661 pushed a commit to khairulkabir1661/vllm that referenced this pull request Mar 27, 2026
… call (vllm-project#35829)

Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>
Monishver11 pushed a commit to Monishver11/vllm that referenced this pull request Mar 27, 2026
… call (vllm-project#35829)

Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>
Signed-off-by: Monishver Chandrasekaran <monishverchandrasekaran@gmail.com>
JiantaoXu pushed a commit to JiantaoXu/vllm that referenced this pull request Mar 28, 2026
… call (vllm-project#35829)

Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

frontend ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants